Questions about the k-Nearest-Neighbour implementation
I am using the k-Nearest-Neighbour operator to get a model for my example set. However, from the operator description alone I am not totally clear about how the algorithm is implemented. I checked the source code of the operator as well but it's difficult to understand.
1st question:
My example mixes numerical data and nominal data. With numerical data, there is no big problem in understanding the meaning of the term "nearest neighbour". It's different however with nominal data: For instance, there is an attribute with, let's say, 3 different possible nominal values or possibly a missing value: costs = low/medium/high/? (Btw: Is this called a 'polynominal attribute'?)
How does RapidMiner's KNN operator treat this when learning the model? Does it:
- skip such data? (Just ignoring it. The algorithm does not use this attribute for training.)
- use some kind of "binary matching decision" like: "If the current attribute value xiis exactly the same as the target value xj, then the neighbour is said to be 'near', whereas if they are different, the neighbour is said to be 'far'."?
- use any other algorithm?
2nd question:
Furthermore: How are missing values being treated (numerical and/or nominal)?
3d question:
How exactly is the weight being implemented that can be applied? In the context of KNN, as far as I know, the distance between xiand xjis multiplied with a weight
- a is the correlation coefficient and
- b is the standard deviation.
Is this what is meant with the "weighted_vote" parameter?
Thanks for the clarification.
1st question:
My example mixes numerical data and nominal data. With numerical data, there is no big problem in understanding the meaning of the term "nearest neighbour". It's different however with nominal data: For instance, there is an attribute with, let's say, 3 different possible nominal values or possibly a missing value: costs = low/medium/high/? (Btw: Is this called a 'polynominal attribute'?)
How does RapidMiner's KNN operator treat this when learning the model? Does it:
- skip such data? (Just ignoring it. The algorithm does not use this attribute for training.)
- use some kind of "binary matching decision" like: "If the current attribute value xiis exactly the same as the target value xj, then the neighbour is said to be 'near', whereas if they are different, the neighbour is said to be 'far'."?
- use any other algorithm?
2nd question:
Furthermore: How are missing values being treated (numerical and/or nominal)?
3d question:
How exactly is the weight being implemented that can be applied? In the context of KNN, as far as I know, the distance between xiand xjis multiplied with a weight
a / bwhere
- a is the correlation coefficient and
- b is the standard deviation.
Is this what is meant with the "weighted_vote" parameter?
Thanks for the clarification.
Tagged:
0
Answers
If no special weights are applied to the distances, then the KNN implementation actually uses 1/k (whereas k is the number of nearest neighbours to be considered) per distance.
Now I got to understand what it does, when the box is checked... (Would be nice if such things be included in the documentary. Thanks.)
As I found out by looking into the java code... source: com.rapidminer.operator.learner.lazy.KNNClassificationModel.java Well, this is a typical mandate for the wiki, because the manual is large enough and explaining every parameter in detail means exp(size). The wiki should be written mainly by the community, but... I must admit, that I cannot spare time at the moment, too, so I will remain silent....
你想欺诈tribute ?
greetings
Steffen
[quote=Steffen]
你想欺诈tribute ?
[/quote]
Just tried now to create a Wiki-page about KNN. (Actually, is Learning a good category?) Got problems with the session as it seems, the wiki reacts very slowly. And I'm thrown out and have to re-login again and again. I'll try again later on.
thank you for (at least) trying to contribute to the wiki. I just wanted to add an article (or rather some notes) to the wiki myself yesterday evening, but it seems to be indeed not correctly working. Hence, we will have to check what is wrong. Maybe it is some sort of sourceforge error, but I don't know really ... so, when we come up with a solution (or the error vanishes magically), we would greatly appreciate that you try to add your contribution again. Unfortunately, the RapidMiner wiki is in some way our problem child in the community since our staff is simply busy with project work and development and only few community members are contributing to the wiki so far.
Cheers,
Tobias
[quote=Sourceforge]
Sorry! We could not process your edit due to a loss of session data. Please try again. If it still doesn't work, try logging out and logging back in.[/quote]
And as it seems I cannot save my stuff to the server. I stored the article on my harddrive - hope to be able to upload it soon.
Furthermore, Wiki formula editing does not work.
well, we actually have not had the time to see what is wrong there. But we will try to have a look into this issue as soon as we can. We will keep you up to date ...
Regards,
Tobias
Nominal attributes are internally represented "pointer-style-like", which means a double value points to a certain place where the original nominal value is stored. The EuclidianDistance will simply treat these values like numerical ones, therefore if two nominal attributes are compared and have the same value, their internal double representations will be the same and thus the distance will be 0.
What will be the distance if they are not equal? Can anybody answer this?
如果我没有误解了代码,KNNRegressionModel and the KNNClassificationModel always considerallattributes - including the label attribute - to build up the model. I can't see that the label attribute is excluded anywhere in the code for building up the model. This is weird - shouldn't the model explicitlynotrely on the label attribute (since it is thedependentvariable)?
KNNClassificationModel: I guess you refer to the following lines... (Constructor) In the for-loop the method iterator() of the class Attributes is called implicitly. Here is the code from the Interface "Attributes": ...which excludes all special Attributes, including the label.
I didnt check the KNNRegressionModel, but I guess it will be a similar call.
greetings
Steffen
yip, Steffen is right. The loop is only performed on the regular attributes which are delivered by the iterator() method of the class "Attributes". The special attributes (inlcuding the label) are of course skipped.
Cheers,
Ingo