Taking the best probability from several experiments
Hi everyone,
I'm new to data mining and RapidMiner, and I'm having some difficulty figuring out how to set up an experiment my researcher friend has told me about.
I need to classify records, for which I'm using Nearest Neighbor with nominal values. There are seven possible labels for each record, and each of the 19 attributes in the record is a nominal value. I've been told the data is far too noisy to classify into seven distinct sets in one go, so what I should do is try and classify by running a binary split over each label: "Is Label 1, Is Not Label 1", "Is Label 2, Is Not Label 2"... and then using the result with the highest confidence as being the actual label.
eg. if Is Label 1 has a confidence of 70%, and Is Label 2 is 90%, I should use Label 2.
I have no idea how to set up this experiment. I don't believe Nearest Neighbor is suitable for this, but I don't know what learner to use. Nor do I know how to setup RapidMiner to run several experiments and choose the best output.
当我问我的朋友我应该do, he came back with "I use a very expensive software package with proprietary algorithms, so I'm not sure how you would do it".
Does anyone have any ideas?
Thanks in advance!
I'm new to data mining and RapidMiner, and I'm having some difficulty figuring out how to set up an experiment my researcher friend has told me about.
I need to classify records, for which I'm using Nearest Neighbor with nominal values. There are seven possible labels for each record, and each of the 19 attributes in the record is a nominal value. I've been told the data is far too noisy to classify into seven distinct sets in one go, so what I should do is try and classify by running a binary split over each label: "Is Label 1, Is Not Label 1", "Is Label 2, Is Not Label 2"... and then using the result with the highest confidence as being the actual label.
eg. if Is Label 1 has a confidence of 70%, and Is Label 2 is 90%, I should use Label 2.
I have no idea how to set up this experiment. I don't believe Nearest Neighbor is suitable for this, but I don't know what learner to use. Nor do I know how to setup RapidMiner to run several experiments and choose the best output.
当我问我的朋友我应该do, he came back with "I use a very expensive software package with proprietary algorithms, so I'm not sure how you would do it".
Does anyone have any ideas?
Thanks in advance!
0
Answers
你是一个多项式(即超过two classes) classification problem. Beside directly using a learning scheme which is capable of working with such a polynominal label, there is also the possibility to divide the k-class classficiation problem into k 2-class classification problems in exactly the way you described here (called 1-vs-all in data mining terminology).
There is a meta learner for this called "Binary2MultiClassLearner" which can be combined with all classification schemes including nearest neighbors. Just wrap the meta learner around your learning scheme like in the following example:
The meta model will than decide for the label with the highest confidence.
So you now know what you can answer: "I use a as great piece of software called RapidMiner for free - and I can even look inside of it's algorithms and there is nothing proprietary at all. Your call."
All the best,
Ingo
When the process finishes, I don't receive any usable output, I just get tabs of "Labal 1 vs all other" "Label 2 vs all other" and such, with the text: When I was using cross-validation, I would get a confusion matrix outputted, which was very helpful. What do I wrap the Binary2MultiClassLearner in in order for it to output results? Is it also relevant to combine this with feature selection and cross validation? I've put it back into my previous data flow with feature selection and cross validation, but I don't know if that makes sense or not!
Thanks ever so much!
Welcome to the world of data mining: there is no definite answer but "Just try it". Usually, for KNN learning, a normalization should be applied before and feature weighting (or at least selection) usually drastically improves the performance. Last but not least you should tune the parameter "k", i.e. the number of used neighbors. Or try a different learning scheme. Or...
Have fun. Cheers,
Ingo
Now to play around!