Best way to handle imbalanced data
I would very much appreciate some guidance. Because RMS has an unusual validation method plus data sampling the test set is small even with hundreds of cases when the class of interest is not balanced. I have read all of the posts about balancing data in the Community Forum and looked at 3 videos on the subject. The current data set I am interested in has 91 patients with dementia and 242 controls. When I uploaded the dataset to WEKA and added SMOTE the AUC increased from .725 to .898, a substantial improvement. I used WEKA simply because SMOTE was an easy filter to add. Several RM forum postings suggest that upsampling techniques will not affect the performance which was not my experience.
I'm stuck with smaller medical datasets to demonstrate/teach binary classification to clinical students. This results in very small numbers in the confusion matrix. What are your recommendations: sample, bootstrapping, SMOTE, etc? Truthfully, I have spent most of my time with TurboPrep and AutoModel so I was unable to figure out how to add SMOTE to the process pipeline. I would appreciate your thoughts.
I'm stuck with smaller medical datasets to demonstrate/teach binary classification to clinical students. This results in very small numbers in the confusion matrix. What are your recommendations: sample, bootstrapping, SMOTE, etc? Truthfully, I have spent most of my time with TurboPrep and AutoModel so I was unable to figure out how to add SMOTE to the process pipeline. I would appreciate your thoughts.
0
Best Answers
-
jacobcybulski Member, University ProfessorPosts:391UnicornYou need to be careful with SMOTE, especially when you have dramatically unbalanced data or a polynomial label with a daisy chain of SMOTE operators. In this way you may end up with a huge proportion of synthetic data as compared to real data, and hence a biased model, especially that you have a very small data set. Also, ensure that you use SMOTE (and other resampling methods) for model training only and the untouched data for validation, this way your validation partition will reflect the population - alternatively you may need to hand-recalculate all your performance measures as the resampled validation partition no longer agrees with your priors.
7 -
varunm1 Moderator, MemberPosts:1,207UnicornHello@GeezerDoc
Two things from my side.
1. Models validated on a sampled datasets, some times fail miserably in real-world problems as the imbalance nature cannot be eliminated from real-world settings. Using sampling on the training side can mitigate this to some extent.
2. Kappa value is a good metric to understand whole model performance (balanced or imbalanced datasets).Regards,
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
7
Answers
@yyhuangrecently did quite a lot with it, maybe she can jump in with some best practices?
Dortmund, Germany