Best way to handle imbalanced data

GeezerDocGeezerDoc MemberPosts:5Contributor I
I would very much appreciate some guidance. Because RMS has an unusual validation method plus data sampling the test set is small even with hundreds of cases when the class of interest is not balanced. I have read all of the posts about balancing data in the Community Forum and looked at 3 videos on the subject. The current data set I am interested in has 91 patients with dementia and 242 controls. When I uploaded the dataset to WEKA and added SMOTE the AUC increased from .725 to .898, a substantial improvement. I used WEKA simply because SMOTE was an easy filter to add. Several RM forum postings suggest that upsampling techniques will not affect the performance which was not my experience.

I'm stuck with smaller medical datasets to demonstrate/teach binary classification to clinical students. This results in very small numbers in the confusion matrix. What are your recommendations: sample, bootstrapping, SMOTE, etc? Truthfully, I have spent most of my time with TurboPrep and AutoModel so I was unable to figure out how to add SMOTE to the process pipeline. I would appreciate your thoughts.

Best Answers

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,368RM Data Scientist
    I personally would also opt for some smote based analysis, even though you need to be a bit careful to not trick your validation.

    @yyhuangrecently did quite a lot with it, maybe she can jump in with some best practices?
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • GeezerDocGeezerDoc MemberPosts:5Contributor I
    Thanks for that insight. It seems to me for biomedical datasets imbalanced data remains a huge challenge. There does not seem to be a magic bullet or an absolute consensus on the right approach. For teaching machine learning basics we can warn students that accuracy is misleading and the precision-recall curves may be better than AUCs. What else should we be telling those new to machine learning?
    varunm1
Sign InorRegisterto comment.