how to connect between Set Role operator and Apply Model operator

m_gholami1991m_gholami1991 MemberPosts:17Contributor I
edited December 2018 inHelp

hi
I have two questions. I would appreciate if you would guide me

1-I have a dataset with 5000 samples that do not have labels. On the other hand, I have another dataset with 100 samples labeled and the samples are not in the 5000 dataset. Is it okay to remove the label of 100 samples and cluster with clustering algorithms and after clustering, add the label to 100 samples and see how many algorithms are clustered correctly. And then, if clustering accuracy increased, We cluster 5000 samples with the same algorithm?


2- I run the scenario for my first question in the RapidMiner, but I do not know how to create connection between two operators. Does anyone know how to connectSet Role operatorandApply Model operatortogether? I will send you the related file and I hope you help me.
َAlso the dataset is available at the below link:

https://drive.google.com/drive/folders/1t2qEnc7K35IHKfDVvG2dqEHZ_lNHZBis

Tagged:

Answers

  • David_ADavid_A Administrator, Moderator, Employee, RMResearcher, MemberPosts:296RM Research

    Hi,

    Regarding your first question:

    Even if your approach is theoretical valid, there is no need to remove the label and do a clustering approach. If you set the role of your label to "label" it will be ignored by the clustering algorithm.

    但是如果你有标签说,为什么不使用supbervised learning algorithm to directly train a model that can predict this label.

    This you then can apply on your second data set (where you don't have the label). Only potential issue I see there is, that the training size

    To connect the two operators, you simply need to left-click on one of the ports you want to connect and then move over to other port and click again (check this tutorial video for an example:https://youtu.be/ophGqpUexKI?t=2m14s)

    Best,
    David

    sgenzer
  • m_gholami1991m_gholami1991 MemberPosts:17Contributor I

    hi

    About the first solution you said: If you give an unlabeled dataset to a supervised learning algorithms likeDecision Tree, in the input of the algorithm, you must specify the label column. Thus, for the 5000 unlabeled samples, it is not possible to use supervised algorithms.I want to get the precision of 170 labeled samples with a clustering algorithm like K-means, and then, based on earn the high percentage accuracy, do clustering on 5000 samples with the same algorithm.

    About the second solution you said: as you see, the input ofApply model operatorneeds a model, and when i connect exa port ofSet role operatorto mod port of Apply model,the error shown. i need both operator but i connect connect them.

    Best Regard,

    Mina

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi@m_gholami1991, Hi@David_A,

    Sorry,@m_gholami1991, I come with questions and not answers :

    I played with your data and builded a "classic process" with a Decision Tree.

    The builded model is the following :

    Decision_Tree.png

    or in an other form :

    Decision_Tree_2.png

    If I good understand, the model is not able to predict (label = One) ?, however :

    When the model is applied to the Training set (output of a Cross Validation), (label = One) is predicted in some cases by the model ... :

    Decision_Tree_3.png

    an other case which is not intuitive for me is the following :

    Depending on the model, (label = Two) is predicted only if Marital > 2,5, however there are cases where

    (label = Two) is predicted with Marital <= 2,5 (Marital = 2) with a confidence = 1 ... :

    Decision_Tree_4.png

    Can you enlighten me on these cases, which are not intuitive for me ?

    Thanks you for your answers,

    Regards,

    Lionel

    NB : The process :


















    <参数键= value = " Sex.true.integer.attribu“4”te"/>









































































  • m_gholami1991m_gholami1991 MemberPosts:17Contributor I

    hi@lionelderkrikor

    thanks of your attention. but you know, this dataset is sample.

    Please pay attention to the picture, I want to explain different steps.

    Step1: 100 data labeled input (Label column has been deleted) and after normalize, based on the number of specified attributes (select by weight operator), clustering is performed.

    Step2: The four evaluation criteria apply to each feature. And finally, the features are ranked according to their importance.

    Step3: after clustering finished, as you know. A new column is added to the features column which shows each sample in which cluster is located. After that, withMap Operator我们可以指定一个匹配的名字clusters and the priorities. (The priorities are the same labels that were already given to the samples.) After that, We can use a tree to model the output. (Many tell me that at this stage there is no need for a decision tree at all and its use is wrong.)

    Step4: 100 data with label entered and with the help of the Apply Model Operator, labeled samples applied to decision tree and compare the percentage accuracy between the label column and the clustering results. and finally, final accuracy is determined by Performance Operator.

    我的问题与步骤3。决的使用是吗sion tree wrong? And if the connection is wrong, which operator should be used?

    Main.jpg 146.4K
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi@m_gholami1991,

    I have a question : how do you establish the correspondance between the clusters results (cluster_0, cluster_1, cluster_2) and the label values (priority = One /Two/Three) ?

    To answer to your question : A priori I don't know if "the using ofDecision Treeis wrong". I recommend you to follow the "classic methodology", that is to say, to perform a Cross Validation with some models and to select the most performant...

    Regards,

    Lionel

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi again@m_gholami1991,

    OK, after reading again your process, I understood the "philosophy" of your process and what you want to perform (excuse me but, here in France, it's late in the evening and I am less efficient...).

    Indeed, you want to compare your clustering results to your labelled data, isn't it ? So, no need ofDecision Tree.

    So you can inspire of this sample process :


















    <参数键= value = " Sex.true.integer.attribu“4”te"/>



































































    However i would try to establish "manually" the correlations between the clustering Results (cluster_0, cluster_1, cluster_2) and your labelled data (Priority = One/Two/Three) at the final step. (if these correlations exist).

    NB :For example, with your sample data, the correlations are not obvious...

    1. Labelled data :


    Decision_Tree_5.png

    2. Clustering results :


    Decision_Tree_6.png

    I hope it helps,

    Regards,

    Lionel

  • m_gholami1991m_gholami1991 MemberPosts:17Contributor I

    hi@lionelderkrikor

    yessss, You know exactly what I mean:)

    I copied the code you provided and I saw the process. You marked the priority column in thefirst Read Operator.

    But you know, I think it is better not to mark this column for thefirst Read Operator, because the clustering algorithm may consider this column for clustering. For the reason I mentioned above, in step 4, I re-entered the dataset and selected this column there.

    On the other hand, if you run my XML file and select this column in thefirst Read Operatorand disable Operators(Stap3: Set Role and Decision Tree | Stap4:TrainData_WithLabel (Read Operator) and Normalize and Apply Model), An error will appear in thePerformance operatorstage, which "Input ExampleSet does not have a label".

    sgenzer
  • m_gholami1991m_gholami1991 MemberPosts:17Contributor I

    hi

    Is there anyone to help me? I really need your help. My thesis presentation is very close. Please....

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    HI@m_gholami1991,

    Here a working process with the Decision Tree model :





















    <参数键= value = " Sex.true.integer.attribu“4”te"/>





    <运营商激活= " true " class = "正常化"那么tibility="9.0.001" expanded="true" height="103" name="Normalize_Dataset" width="90" x="313" y="34"/>


    <参数键= "属性" value = "年龄|残疾everity|Edu|IsCityOrNot|Job|Marital|Money|Mostamari|PoshtNobat|Sex|TedadMalolDarKhanevade|TedadMaloliatHarFard"/>





    <运营商激活= " true "类= com应该“正规化”patibility="9.0.001" expanded="true" height="82" name="De-Normalize" width="90" x="514" y="85"/>









    <参数键= value = " Sex.true.integer.attribu“4”te"/>














    <参数键= value = " Sex.true.integer.attribu“4”te"/>





    <运营商激活= " true " class = "正常化"那么tibility="9.0.001" expanded="true" height="103" name="Normalize_TestData" width="90" x="1720" y="289"/>









    <运营商激活= " true " class = "正常化"那么tibility="9.0.001" expanded="true" height="103" name="Normalize_TrainingData" width="90" x="313" y="238"/>








































































    <运营商激活= " true " class = "正常化"那么tibility="9.0.001" expanded="true" height="103" name="Normalize" width="90" x="1117" y="340">






































































































































































    I hope it helps,

    Regards,

    Lionel

    sgenzer
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi@m_gholami1991,

    And here a simplified process without Decision Tree :

    Like I said in a previous post, data are just clustered (after performing feature selection) and then

    simply compared to the labeled data.

    The process :





















    <参数键= value = " Sex.true.integer.attribu“4”te"/>










    <运营商激活= " true " class = "正常化"那么tibility="9.0.001" expanded="true" height="103" name="Normalize_Dataset" width="90" x="447" y="34"/>


    <参数键= "属性" value = "年龄|残疾everity|Edu|IsCityOrNot|Job|Marital|Money|Mostamari|PoshtNobat|Sex|TedadMalolDarKhanevade|TedadMaloliatHarFard"/>





    <运营商激活= " true "类= com应该“正规化”patibility="9.0.001" expanded="true" height="82" name="De-Normalize" width="90" x="581" y="136"/>









    <参数键= value = " Sex.true.integer.attribu“4”te"/>














    <运营商激活= " true " class = "正常化"那么tibility="9.0.001" expanded="true" height="103" name="Normalize_TrainingData" width="90" x="313" y="238"/>








































































    <运营商激活= " true " class = "正常化"那么tibility="9.0.001" expanded="true" height="103" name="Normalize" width="90" x="1117" y="340">





















































































































    I hope it helps, too.

    Regards,

    Lionel

    sgenzer
Sign InorRegisterto comment.