"Balanced sampling for network training?"
chaosbringer
MemberPosts:21Maven
Hi,
i have a very imbalanced sample set, e.g. 99% true and 1% false. Is it reasonable to select a balanced subset with a 50/50-distirbution for neural network trainining? the reason for this is, that i guess training on the original dataset may induce a bias on the true-samples.
Can you suggest my some literature that covers this topic especially for neural netowrks?
Thank you very much,
chaosbringer
i have a very imbalanced sample set, e.g. 99% true and 1% false. Is it reasonable to select a balanced subset with a 50/50-distirbution for neural network trainining? the reason for this is, that i guess training on the original dataset may induce a bias on the true-samples.
Can you suggest my some literature that covers this topic especially for neural netowrks?
Thank you very much,
chaosbringer
Tagged:
0
Answers
I recommend to ask the question onhttp://stats.stackexchange.com/
Although we try to ask general questions about data mining here, the amount of experts with time is quite low. Nevertheless it would be great if you post the link to the question here (if you going to ask there).
greetings,
steffen
http://stats.stackexchange.com/questions/6254/balanced-sampling-for-network-training
Dikran Marsupial:
Yes, it is reasonable to select a balanced dataset, however if you do your model will probably over-predict the minority class in operation (or on the test set). This is easily overcome by using a threshold probability that is not 0.5. The best way to choose the new threshold is to optimise on a validation sample that has the same class frequencies as encountered in operation (or in the test set).
Rather than re-sample the data, a better thing to do would be to
give different weights to the positive and negative examples in the training criterion. This has the advantage that you use all of the available training data. The reason that a class imbalance leads to difficulties is not the imbalance per se. It is more that you just don't have enough examples from the minority class to adequately represent its underlying distribution. Therefore if you resample rather than re-weight, you are solving the problem by making the distribution of the majority class badly represented as well.
Some may advise simply using a different threshold rather than reweighting or resampling. The problem with that approach is that with ANN the hidden layer units are optimised to minimise the training criterion, but the training criterion (e.g. sum-of-squares or cross-entropy) depends on how the behaviour of the model away from the decision boundary rather than only near the decision boundary. As as result hidden layer units may be assigned to tasks that reduce the value of the training criterion, but do not help in accurate classification. Using re-weighted training patterns helps here as it tends to focus attention more on the decision boundary, and so the allocation of hidden layer resources may be better.
For references, a google scholar search for "Nitesh Chawla" would be a good start, he has done a fair amount of very solid work on this.
http://www.google.fr/search?q=imbalanced+neural+network
Now all we have to do is work out the right one, or whether there can be a right one ;D
if i understand the post an stackexchange.com correct, it is suggested to weight the samples. I think the operator for this task is "Generate Weights (Straified)" in rapidminer.
However, is there a way in weighting, if the label is numeric? Is this the purpose of the Operator "Generate Weight (LPR)"? I don't really understand the use of the operator from its description.
Thank you very much.
if your label is numeric, you don't have a classification and hence no classes and hence no class imbalance.
If you have true and false, you have no numbers. If true and false are encoded by numebrs, you will need to turn the attributes to nominal ones by applying Numerical to Binominal.
Greetings,
Sebastian
差别矩阵preproce也可能是有用的ss the data. Dimensional reduction may increase the ability to discriminate between true/false.
A couple ideas- hopefully you find something that works.