了解混合euclidean distance calculation for polynomial and nominal attributes

alourencoalourenco MemberPosts:5Contributor II
edited December 2018 inProduct Feedback - Resolved

Hi!

I'm aware of some previous posts about how the mixed euclidean distance is calculated. My understanding is that for numeric attributes it is standard euclidean claculation whereas for nominal attributes a distance of 1 is accounted if both values are not the same.

However, I cannot make sense of the results I am getting for a simple example where I have polynomial and nominal attrbutes (which I expected that would be accounted the same way).

The data is as follows:

REQUEST EXAMPLE

1 10 7 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

REFERENCE EXAMPLES

1 1 2 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 15 4 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 15 4 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The first column is the row id, the second column is a class attrbiute (ignored in calculation), the third and fourth columns are polynomial and the rest are binomial.

The output is:

1.0 1.0 0.0
1.0 2.0 1.4142135623730951
1.0 3.0 1.4142135623730951

How can the distance between the request example and the first of the reference examples be zero? Most likely, it is a very obvious calculation but I cannot see it...

I would appreciate some help!

My thanks!

Tagged:
0
0 votes

Declined·Last Updated

No activity or votes since Oct 2018. Please comment and cc sgenzer if this should be reopened. RM-3793

Comments

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Can you post your XML---it is hard to see how you have your operator configured, and it could be something in the parameter setting (e.g., only looking at nominal and not numerical attributes, etc).

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • alourencoalourenco MemberPosts:5Contributor II

    Sure! Many thanks for the prompt response.

    Mosyafa
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    So I can't see your original data here, but I created a simple test process along the lines you explained. And everything seems to be working normally here. Take a look a this process:




















































    > < /过程

    > < /过程

    This seems to be working as expected. The record that is a duplicate shows a distance of zero. The ones that have differences in the 3 numerical attributes are being calculated in the expected way. And the one record that has the same numerical values but 3 different categorical attributes has a distance value of sqrt(3) as expected.

    So here are a few ideas for you to troubleshoot in your own setup:

    • Are you sure you have set the role of the id field to ID in RapidMiner? If not, that will affect the outcome.
    • Are you sure that the fields that are being included in the comparison have same attribute names? That would also affect the outcome.
    • Make sure all data types are correct (numericals are numeric and categoricals are polynominal).

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • alourencoalourenco MemberPosts:5Contributor II

    There are no duplicate examples nor numeric attributes. I am attaching the data to this post.

    I am sure that both sets of examples have the exact number of attributes and that the attributes are named the same, have the same type, and are in the same order. The id is labelled as id, the class is a label, imput and grav are nominal attributes amd the rest of the attributes are bonomial.

    Many thanks!

    data.zip 33.8K
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    I am confused---in the dataset you supplied, none of the conditions you specified appear to be true!

    • They do not contain the same number attributes: "small-request" has 26 attributes but "small-test10" has 28 attributes (code and age are extras)
    • In "small-test10" all attributes are of type integer, while in "small-reference" all attributes are nominal or binominal
    • 只有一个在每个数据集和其他例子r than the extra attributes they do appear to be duplicates
    • There is no id field present, only a label

    These discrepancies would certainly explain why you are not getting the expected results. You should harmonize your datasets in terms of number of attributes and data types, correct discrepancies as needed, and try the operator again.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • alourencoalourenco MemberPosts:5Contributor II

    So sorry, I included three datasets instead of two, hence your confusion. I'm attaching the data agians (also as CSV files) and some screenshots of the data and the statistics as presented in RM.

    In short, I have one example (small request) that I want to compare against three examples (small reference).

    You are right in that the examples have the same values for all the binomial attributes. However, the values for imput and grav (the two polynomial attribs) are not always the same.

    How can the distance between the request and the reference #1 be zero if they have different values for these attributes?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Yep, I agree, these results are fishy.

    @mschmitzmight know something more about what is going on with this cross-distance calculation. It doesn't seem to like those initial polynominal attributes (not the binominal ones). Is this a bug in the implementation of cross-distance? Or is there some other weird effect going on here that is not obvious?

    @sgenzeryou might also remember, there was a related problem with cross-distance earlier in the year. Do you know what ever happened with this thread:https://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Forum/Cross-Distances-operator-Weird-results/m-p/46161

    It looks like it was simply abandoned, but combined with this thread, it makes me think there is likely a problem with this operator...

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,376RM Data Scientist

    Hi@alourenco,@Telcontar120,

    i've ran a few tests and it looks like a bug. I will file a ticket.

    BR,

    Martin

    CC:@sgenzer

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
Sign InorRegisterto comment.