"Clustering and Normalization"

hgwelechgwelec MemberPosts:31Guru
edited May 2019 in帮助
Dear All,

I have a dataset which consists of 20 numeric variables.


I would like to apply z-score transformation to all variables : I use normalization node and all ok until here


The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense.

1) Is there a nore to do this for all 20 fields
2) If not can someone provide an example on how to do it for a single field only?



Thanks!
Tagged:

Answers

  • steffensteffen MemberPosts:347Maven
    Hello

    The only hint I can give you is to use AttributeConstruction. Unfortunately you have to include the mean and stdev manually.

    regards,

    史蒂芬

  • haddockhaddock MemberPosts:849Maven
    Hi,

    The nice thing about RM is that you can do things in many different ways...


    <帕拉meter key="target_function" value="random"/>
    <帕拉meter key="number_of_attributes" value="20"/>




    <帕拉meter key="csv_file" value="bla"/>






    <帕拉meter key="filename" value="bla"/>




    <帕拉meter key="remove_double_attributes" value="false"/>


    <帕拉meter key="skip_features_with_name" value="att[0-9]*"/>


    <帕拉meter key="replace_what" value="_from_ES2"/>
    <帕拉meter key="apply_on_special" value="false"/>

    Bit of a mess, because normalization seems to hit objects even if you store them away, but it does the job... I think.

    PS Can someone prod Ingo towards his PM box here, thanks.


  • haddockhaddock MemberPosts:849Maven
    Silly me :-\ if I tick "create view" on the normalization operator I don't need to write and read back the CSV, like this..


    <帕拉meter key="target_function" value="random"/>
    <帕拉meter key="number_of_attributes" value="20"/>




    <帕拉meter key="name" value="original"/>
    <帕拉meter key="io_object" value="ExampleSet"/>
    <帕拉meter key="remove_from_process" value="false"/>


    <帕拉meter key="return_preprocessing_model" value="true"/>
    <帕拉meter key="create_view" value="true"/>




    <帕拉meter key="name" value="original"/>
    <帕拉meter key="io_object" value="ExampleSet"/>


    <帕拉meter key="remove_double_attributes" value="false"/>


    <帕拉meter key="skip_features_with_name" value="att[0-9]*"/>


    <帕拉meter key="replace_what" value="_from_ES2"/>
    <帕拉meter key="apply_on_special" value="false"/>

  • hgwelechgwelec MemberPosts:31Guru
    Hello and Thanks for reply,


    However i do not understand the example given : Where is the DE-normalization happening for every attribute?



    Thanks again!

  • hgwelechgwelec MemberPosts:31Guru
    Haddock,


    Really interesting method:)

    The problem is though that the clustering output still does not show you the DE-normalized values such as in:


    Cluster 0 :
    attr1 : x
    attr2 : y
    attr3 : z


    with x,y,z being DE-normalized


    Perhaps a DE-normalize operator would be useful!?
  • haddockhaddock MemberPosts:849Maven
    Hi,

    The original problem was...
    The problem now is that i want to de-normalize values of all 20 fields to the原始值
    The method shows the original values, or do you not agree?
    The problem is though that the clustering output still does not show you the DE-normalized
    If by "DE-normalized" you mean something other than the "original" values then perhaps so, but that was not the question.

    In short, I disagree that a de-normalizer operator is necessary, because you can always just keep the originals!

  • hgwelechgwelec MemberPosts:31Guru
    Hi again Haddock,

    First of all : ***Thanks for your help*** i do not mean to sound rude :-)

    However the *full* quote was :
    The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense

    Notice that the last part says : "so that cluster values make sense"

    Unfortunately this is not the case with your solution. Again i do not want to appear rude i am just giving my opinion that perhaps an operator would prove helpful. Just trying to add my 2 cents...

    Thanks!
  • haddockhaddock MemberPosts:849Maven
    Hi,

    I'm always amused by posts that start "i do not mean to sound rude".

    Versions one and two of the code did the job. Did you run them? Version three was only put in to make things clearer for you. Something got flipped and the clusters got lost. So I'll edit version three out.

    Maybe you'll want to edit your last post as well.


  • hgwelechgwelec MemberPosts:31Guru
    I'm always amused by posts that start "i do not mean to sound rude".
    Great!.Now on with the problem
    Versions one and two of the code did the job Did you run them?
    No they didn't, they did the job the way you perceived it / Yes i did run all of them
    Version three was only put in to make things clearer for you. Something got flipped and the clusters got lost. So I'll edit version three out.
    So that means that there can be an output like the one i explained? To have the numbers in the cluster model prior the normalization? I sure would like to see how this is possible because this is actually what i wanted originally.
    Maybe you'll want to edit your last post as well.
    Sure if you explain why should i, no problem!
  • haddockhaddock MemberPosts:849Maven
    No they didn't, they did the job the way you perceived it / Yes i did run all of them
    Excellent, in which case you can explain in what way the original values are not tied to the clusters.
    So that means that there can be an output like the one i explained? To have the numbers in the cluster model prior the normalization? I sure would like to see how this is possible because this is actually what i wanted originally.
    No, it means exactly what it says, I tried to clarify my solution by adding better titles to the operators, and things stopped working.
    Quote
    Maybe you'll want to edit your last post as well.

    Sure if you explain why should i, no problem!
    Because you are wrong. Do you disagree that if you normalise numbers and then de-normalise them you should end up with the numbers you started with? De-normalising can be effected just by keeping the originals, which is what my solution does, and I'm sorry you can't understand that.

  • hgwelechgwelec MemberPosts:31Guru
    Haddock,

    The point is that your solution does NOT output a ***Clustering Model window*** with de-normalized values! The sequence should be the following

    1) Get unnormalized values
    2) Normalize them
    3) run clustering model using normalized values
    4) show the CLUSTERING MODEL'S RESULTS DENORMALIZED. I do *not* want for every row it's associated de-normalized value!!

    Your solution does not do step (4) , It writes each de-normalized values to a table! Do you understand the difference Haddock??

    Please try to understand what is sought here..

    From what i can tell (as steffen said) there is no way to do this automatically in RM. If someone else can help on this, please do so


    Thanks!
  • haddockhaddock MemberPosts:849Maven
    4) show the CLUSTERING MODEL'S RESULTS DENORMALIZED. I do *not* want for every row it's associated de-normalized value!!
    Please explain this term, and how we were meant to guess it from your original question, let me remind of what it actually was.....
    The problem now is thati want to de-normalize values of all 20 fields to the original valuesso that cluster values make sense.
    A word of advice, when you can't see over the top of the hole you are digging, stop digging.


  • keithkeith MemberPosts:157Guru
    If I understand what hgwelec is asking for, he wants to be able to express the centroid values of each cluster in the scale of the original data.

    He's not talking about having an ExampleSet that contains both the raw values and normalized values for each data point. He wants to describe the clusters in the data's natural scale. This would help, for example, in explaining the clusters are to other people, or even just to better interpret the model himself.

    If my reading of the problem is correct, then the following discussion may be helpful...

    You'd need to know the mean and standard deviation of each attribute in the original data to convert the normalized centroid values to original scale values (i.e. "denormalize"). While RM computes the sum and std dev as part of the meta data view of an ExampleSet, I'm not sure there's a way to get to those values. If you're reading data from a database, you might be able to have a second DatabaseExampleSource with a query that returns the mean and std dev for each attribute.

    Once you have the mean and std dev, you need to get the centroid values into an example set. I haven't worked with clustering models, so I don't know how this would be done in RM. But once you have both the mean+stddev and the centroid values, you can probably use one of the Join operators to match up the clusters with their mean+stdev, and then use AttributeConstruction (as steffen mentioned in the first reply to this thread) to build the centroid values on the original data's scale.

    Hopefully this doesn't add further confusion to the situation...

    Keith
  • haddockhaddock MemberPosts:849Maven
    Nice one Keith,

    Now that I do understand, and curiously he'll still need the original/raw data
    While RM computes the sum and std dev as part of the meta data view of an ExampleSet, I'm not sure there's a way to get to those values.
    I think this does the necessary.


    <帕拉meter key="target_function" value="random"/>
    <帕拉meter key="number_of_attributes" value="1"/>


    <帕拉meter key="attribute_name" value="att1"/>
    <帕拉meter key="window_width" value="100"/>
    <帕拉meter key="result_position" value="start"/>


    <帕拉meter key="replace_what" value="\(|\)"/>


    <参数键= value =“old_name moving_averageatt1"/>
    <帕拉meter key="new_name" value="avg_att1"/>


    <帕拉meter key="attribute_name" value="att1"/>
    <帕拉meter key="window_width" value="100"/>
    <帕拉meter key="aggregation_function" value="standard_deviation"/>
    <帕拉meter key="result_position" value="start"/>


    <帕拉meter key="replace_what" value="\(|\)"/>


    <参数键= value =“old_name moving_averageatt1"/>
    <帕拉meter key="new_name" value="stddev_att1"/>





    and this works out the average for each cluster - just added a change of role on the cluster and an OLAP operator to my original offering.


    <帕拉meter key="target_function" value="random"/>
    <帕拉meter key="number_of_attributes" value="20"/>




    <帕拉meter key="name" value="original"/>
    <帕拉meter key="io_object" value="ExampleSet"/>
    <帕拉meter key="remove_from_process" value="false"/>


    <帕拉meter key="return_preprocessing_model" value="true"/>
    <帕拉meter key="create_view" value="true"/>




    <帕拉meter key="name" value="original"/>
    <帕拉meter key="io_object" value="ExampleSet"/>


    <帕拉meter key="remove_double_attributes" value="false"/>


    <帕拉meter key="skip_features_with_name" value="att[0-9]*"/>


    <帕拉meter key="replace_what" value="_from_ES2"/>
    <帕拉meter key="apply_on_special" value="false"/>


    <帕拉meter key="name" value="cluster"/>



    <帕拉meter key="att1" value="average"/>
    <帕拉meter key="att2" value="average"/>
    <帕拉meter key="att3" value="average"/>
    <帕拉meter key="att4" value="average"/>
    <帕拉meter key="att5" value="average"/>
    <帕拉meter key="att6" value="average"/>
    <帕拉meter key="att7" value="average"/>
    <帕拉meter key="att8" value="average"/>
    <帕拉meter key="att9" value="average"/>
    <帕拉meter key="att10" value="average"/>
    <帕拉meter key="att11" value="average"/>
    <帕拉meter key="att12" value="average"/>
    <帕拉meter key="att13" value="average"/>
    <参数键= " att14 " value = "普通" / >
    <帕拉meter key="att15" value="average"/>
    <帕拉meter key="att16" value="average"/>
    <帕拉meter key="att17" value="average"/>
    <帕拉meter key="att18" value="average"/>
    <帕拉meter key="att19" value="average"/>
    <帕拉meter key="att20" value="average"/>

    <帕拉meter key="group_by_attributes" value="cluster"/>

    Thanks again for bringing clarity to the question, how we were meant to get that from the original question remains a mystery to me.

  • hgwelechgwelec MemberPosts:31Guru
    @keith,

    This is what i am talking about and steffen understood what i meant right from my 1st post.


    So by using attribute construction it can be done but imagine building new attributes for 60 input variables! so the question is whether some node can be used to calculate all this information for all -say- 60 attributes and i guess this cannot happen (?) as steffen originally said.

    @haddock

    It appears that you still don't get it but may be i am wrong...can you do the same example that you last posted for 60 input variables? How much time will it take you to do it? Let alone also having to do a log transformation to each of 60 variables to fix their skewed distributions...
  • steffensteffen MemberPosts:347Maven
    Hello
    hgwelec wrote:

    @keith,
    This is what i am talking about and steffen understood what i meant right from my 1st post.
    I'd like to see myself in such a glorious light, but sorry: I did understand it exactly as haddock did until keith made your point clear.

    @haddock:
    I did not know the operator MovingAverage yet ... really nice. However, it seems the calculation of stdev is messed up, isn't it ?

    @hgwelec:
    The second process of haddock does exactly what you want. He was able to calculate the cluster centroids for the denormalized (ie. not normalized) values and hence the denormalized cluster centers (this is only correct if the cluster centroids of the cluster operator are calculated as mean .. which is correct for KMeans). The issue of scalability remains, but: Either you add an entry for each attribute in the aggregation operator manually OR you use a loop .... in JAVA, which means hacking an operator yourself. I do not see another option.

    Again we have faced an example of the law of leaky abstraction ...

    kind regards,

    史蒂芬

    PS: the process of haddock is ok, but I did not check the calculation of the values by an example (just to be sure) .. my head is a little fuzzy today...
  • haddockhaddock MemberPosts:849Maven
    Greets Steff!
    I did not know the operator MovingAverage yet ... really nice. However, it seems the calculation of stdev is messed up, isn't it ?
    Needs checking - but if you think so, that'll do for me. You'll probably understand if I say that my interest in this thread has waned somewhat ;D

    Reminds me of an old Oxford philosophy exam story.....

    Is this a question?

    Yes, if this is an answer.




  • keithkeith MemberPosts:157Guru
    haddock wrote:

    Nice one Keith,

    Now that I do understand, and curiously he'll still need the original/raw data

    I think this does the necessary.

    Ah, clever. Using the moving average to create a window that spans the entire dataset, and calculating the mean/stdev. Wouldn't have thought to approach it that way.

    and this works out the average for each cluster - just added a change of role on the cluster and an OLAP operator to my original offering.

    Also a smarter approach to the problem than I would have thought of. I was fixated on trying to access the centroid values and convert them back to the original, non-normalized scale. Instead, you're labelling all the original data rows with the cluster, and calculating the means directly. Clever...

    If it was possible to access the centroid values directly and apply the mean/stdev calculations from your first code sample, that would probably be a more scalable solution than joining the data to itself and computing the sum/stdev across the entire data set (depends on how many rows he's dealing with). It would also (I think) handle the case where the cluster centers are calculated by something other than mean (as steffen alludes to). But what you presented certainly solves the problem as presented. Thanks, I learned something today.

    Thanks again for bringing clarity to the question, how we were meant to get that from the original question remains a mystery to me.
    That's what great about having a forum where you get many eyeballs looking at a question. For example, to me, when I read:

    The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense.
    ... it was pretty quickly apparent that, even if he didn't have the terminology quite right, he was talking about data that describe the clusters ("cluster values" a.k.a. centroids), and meant "original scale" rather than "original values". But I never would have come up with the solution haddock did.

    Despite the frustrations expressed on this thread, this forum is still a friendlier place for earnest newbies (which I was not that long ago) to learn RapidMiner than the R-help list is for R, and is one of the many things I think is great about RM.

    Keith


  • haddockhaddock MemberPosts:849Maven
    你好,基思!

    Both you and Steffen come out of this episode as very solid citizens who deserve the respect you get, so many thanks to you both on behalf of all Rapido heads.
    Despite the frustrations expressed on this thread, this forum is still a friendlier place for earnest newbies (which I was not that long ago) to learn RapidMiner than the R-help list is for R, and is one of the many things I think is great about RM.
    I've learnt from two sources, Ralf's most excellent course, and trying to answer the puzzles set right here, so absolutely spot on, my friend, spot on.

    :)

  • ShubhaShubha MemberPosts:139Guru
    Hi,
    So by using attribute construction it can be done but imagine building new attributes for 60 input variables! so the question is whether some node can be used to calculate all this information for all -say- 60 attributes


    <帕拉meter key="target_function" value="random"/>
    <帕拉meter key="number_of_attributes" value="20"/>




    <帕拉meter key="name" value="original"/>
    <帕拉meter key="io_object" value="ExampleSet"/>
    <帕拉meter key="remove_from_process" value="false"/>


    <帕拉meter key="return_preprocessing_model" value="true"/>
    <帕拉meter key="create_view" value="true"/>




    <帕拉meter key="name" value="original"/>
    <帕拉meter key="io_object" value="ExampleSet"/>


    <帕拉meter key="remove_double_attributes" value="false"/>


    <帕拉meter key="skip_features_with_name" value="att[0-9]*"/>


    <帕拉meter key="replace_what" value="_from_ES2"/>
    <帕拉meter key="apply_on_special" value="false"/>


    <帕拉meter key="name" value="cluster"/>


    <帕拉meter key="attribute" value="cluster"/>

    <帕拉meter key="condition_class" value="attribute_value_filter"/>
    <帕拉meter key="parameter_string" value="cluster=%{loop_value}"/>


    <帕拉meter key="condition_class" value="attribute_name_filter"/>
    <帕拉meter key="parameter_string" value="att.*"/>
    <帕拉meter key="apply_on_special" value="true"/>




    <帕拉meter key="attribute_name" value="Centroid_%{loop_value}"/>
    <帕拉meter key="aggregation_attributes" value="att_.*"/>
    <帕拉meter key="aggregation_function" value="average"/>
    <帕拉meter key="keep_all" value="false"/>



    <帕拉meter key="remove_double_attributes" value="false"/>



    The code is nothing but the haddock's Aggregation operator being replaced by a set of operators in the end.... Also, as pointed out the same approach of finding the average cannot be taken, say if you are dealing with KMedoids....
  • ShubhaShubha MemberPosts:139Guru
    If it was possible to access the centroid values directlyand apply the mean/stdev calculations from your first code sample, that would probably be a more scalable solution than joining the data to itself and computing the sum/stdev across the entire data set (depends on how many rows he's dealing with). It would also (I think) handle the case where the cluster centers are calculated by something other than mean (as steffen alludes to).
    --- by Keith

    The below is a tricky (infact, a very tricky) way of extracting the centroid values directly from the model.


    <帕拉meter key="target_function" value="random"/>
    <帕拉meter key="number_of_attributes" value="20"/>


    <帕拉meter key="k" value="3"/>



    <帕拉meter key="result_file" value="Z:\Clus.csv"/>


    <帕拉meter key="filename" value="Z:\clus.csv"/>
    <参数键=“read_attribute_names”值= " false"/>
    <帕拉meter key="column_separators" value=";\s*"/>
    <帕拉meter key="trim_lines" value="true"/>




    <帕拉meter key="condition_class" value="attribute_value_filter"/>
    <帕拉meter key="parameter_string" value="att1=.*\t.*|Cluster \d"/>


    <帕拉meter key="attributes" value="att1"/>
    <帕拉meter key="split_pattern" value=" "/>





    <帕拉meter key="mid" value="if(att1_2>1,1,att1_2)"/>



    <帕拉meter key="attribute_name" value="mid"/>
    <帕拉meter key="keep_original_attribute" value="false"/>


    <帕拉meter key="condition_class" value="attribute_value_filter"/>
    <帕拉meter key="parameter_string" value="att1_1=.*\t.*"/>


    <帕拉meter key="attributes" value="att1_1"/>
    <帕拉meter key="split_pattern" value=":\t"/>


    <帕拉meter key="condition_class" value="attribute_name_filter"/>
    <帕拉meter key="parameter_string" value="att1_2"/>
    <帕拉meter key="invert_filter" value="true"/>




    <帕拉meter key="old_name" value="att1_1_2"/>
    <参数键= =“重心”/“new_name”值>


    <帕拉meter key="old_name" value="cumulative(mid)"/>
    <帕拉meter key="new_name" value="cluster_num"/>


    <帕拉meter key="group_attribute" value="cluster_num"/>
    <帕拉meter key="index_attribute" value="att1_1_1"/>
    <帕拉meter key="consider_weights" value="false"/>



    A Note:
    1. This method can be applied even for KMedoids....I meant to say, this also eludes the issue of "What if the cluster centers are not the mean?".
    2. The centroid values are acurate for three decimal places, because the centroid values are read as it is from the "Text View" of the model. If the "Text view" gave, say five digits after the decimal point, then the same would be the result in the exampleset produced.


    Best,
    Shubha Karanth
  • haddockhaddock MemberPosts:849Maven
    Hi Shubha,

    I think there is a problem with your first example, because it only covers the case where there are two clusters, and with the second there is no data by the time of the first split, so I'm not sure why it is here at all . Bemused readers should run to the break, like this ( I've just removed the drive letter and put in a break )...


    <帕拉meter key="target_function" value="random"/>
    <帕拉meter key="number_of_attributes" value="20"/>


    <帕拉meter key="k" value="3"/>



    <帕拉meter key="result_file" value="Clus.csv"/>


    <帕拉meter key="filename" value="clus.csv"/>
    <参数键=“read_attribute_names”值= " false"/>
    <帕拉meter key="column_separators" value=";\s*"/>
    <帕拉meter key="trim_lines" value="true"/>




    <帕拉meter key="condition_class" value="attribute_value_filter"/>
    <帕拉meter key="parameter_string" value="att1=.*\t.*|Cluster \d"/>


    <帕拉meter key="attributes" value="att1"/>
    <帕拉meter key="split_pattern" value=" "/>





    <帕拉meter key="mid" value="if(att1_2>1,1,att1_2)"/>



    <帕拉meter key="attribute_name" value="mid"/>
    <帕拉meter key="keep_original_attribute" value="false"/>


    <帕拉meter key="condition_class" value="attribute_value_filter"/>
    <帕拉meter key="parameter_string" value="att1_1=.*\t.*"/>


    <帕拉meter key="attributes" value="att1_1"/>
    <帕拉meter key="split_pattern" value=":\t"/>


    <帕拉meter key="condition_class" value="attribute_name_filter"/>
    <帕拉meter key="parameter_string" value="att1_2"/>
    <帕拉meter key="invert_filter" value="true"/>




    <帕拉meter key="old_name" value="att1_1_2"/>
    <参数键= =“重心”/“new_name”值>


    <帕拉meter key="old_name" value="cumulative(mid)"/>
    <帕拉meter key="new_name" value="cluster_num"/>


    <帕拉meter key="group_attribute" value="cluster_num"/>
    <帕拉meter key="index_attribute" value="att1_1_1"/>
    <帕拉meter key="consider_weights" value="false"/>


    Perhaps you could explain what I've missed ? ;D

    Good weekend!
  • hgwelechgwelec MemberPosts:31Guru
    It appears that the way that i described my problem was not the right one.

    I have seen other users express that my terminology was not correct i have no reason to think otherwise and for that i have to agree. It wasn't.

    But since the essence of discussions in this forum is to both solve our problems *and* to draw some insights as to how RM can become better, i feel that even though a JAVA code could be a solution (when the dataset contains MANY attributes) for users that do no have the necessary programing skills the problem cannot be easily fixed.

    Since normalization prior any clustering process is usually required, perhaps a De-Normalize node would prove to be very useful. .


    Many Thanks!
  • haddockhaddock MemberPosts:849Maven
    And I still disagree!
Sign InorRegisterto comment.