"rule induction- correlated variables and cross validation of individual rules"
Hi,
I have a couple questions about rule induction in rapid miner, I am a bit of a novice in data mining.
First of all I need to say the reason I am using rule induction instead of other learning techniques is because it is very important to generate a classifier that can be interpreted by a human, so although a classifier like svm might perform well, its rules would likely not be of any use for us.
I have a problem where certain variables are always going to be positively correlated (this is known ahead of time and is just the nature of the variables), so although these variables may have different thresholds, if one of them if found to have a certain minimum value to be classified in a given class the the other variable should also have a minimum value, not a maximum value. I often get rules where where one of these positively correlated variables is given a minimum (x > ..) and the other is given a maximum (y < ..) which clearly indicates over-fitting. Is there a way to specify in the dataset that certain variables are positively correlated so such rules will never be examined?
In my dataset it also only makes sense for certain variable to have thresholds, not ranges, so similarly whenever I see rules like x > min and x < max it is a case of overfitting and I would similarly like to tell the learner not to attempt such combinations if this is possible.
I was also wondering if there is a way to perform cross validation on rules independently. We only have a subset of the total number of variables that would be required to build a proper classifier and are aware that in many, probably most cases, we cannot classify based on the variables we have. I am however interested in the cases where we can classify, so the individual rules that provide strong evidence for a given outcome. Cross validation in rapid miner, as I have used it, performs poorly because it is based on trying to classify everything. I would however like to see how well the best individual rules perform on unseen data instead of how well the entire rule set performs on unseen data, is this possible?
Anybody have any suggestions?
Thanks in advance,
Barsh
I have a couple questions about rule induction in rapid miner, I am a bit of a novice in data mining.
First of all I need to say the reason I am using rule induction instead of other learning techniques is because it is very important to generate a classifier that can be interpreted by a human, so although a classifier like svm might perform well, its rules would likely not be of any use for us.
I have a problem where certain variables are always going to be positively correlated (this is known ahead of time and is just the nature of the variables), so although these variables may have different thresholds, if one of them if found to have a certain minimum value to be classified in a given class the the other variable should also have a minimum value, not a maximum value. I often get rules where where one of these positively correlated variables is given a minimum (x > ..) and the other is given a maximum (y < ..) which clearly indicates over-fitting. Is there a way to specify in the dataset that certain variables are positively correlated so such rules will never be examined?
In my dataset it also only makes sense for certain variable to have thresholds, not ranges, so similarly whenever I see rules like x > min and x < max it is a case of overfitting and I would similarly like to tell the learner not to attempt such combinations if this is possible.
I was also wondering if there is a way to perform cross validation on rules independently. We only have a subset of the total number of variables that would be required to build a proper classifier and are aware that in many, probably most cases, we cannot classify based on the variables we have. I am however interested in the cases where we can classify, so the individual rules that provide strong evidence for a given outcome. Cross validation in rapid miner, as I have used it, performs poorly because it is based on trying to classify everything. I would however like to see how well the best individual rules perform on unseen data instead of how well the entire rule set performs on unseen data, is this possible?
Anybody have any suggestions?
Thanks in advance,
Barsh
Tagged:
0
答案
welcome to the forum. Hmm, I am afraid this will not be possible without revisiting the code. You could add a new parameter indicating if you would want to only allow certain types of relations for rule creation, for example only "greater equals".
我不确定如果我有the point but I think those could be two different issues. You could for example use the operator "Drop Uncertain Predictions" so that only predictions with a strong confidence would be regarded during cross validation. This way you would end up with "ok, I have only classified 40% of the data but for those 40% we have been correct in 98% of the cases..."
Talking about single rules and their performance is in most cases not really sensible, isn't it? The individual rules have been created on parts of the data set only and the rules are also only applicable on data set parts not covered already by preceeding rules. So I would not expect that single rule performance is a good idea for combined rule sets anyway.
Cheers,
Ingo