Show prevalence of largest class in Performance (Classification) and similar operators

TripartioTripartio MemberPosts:37Maven
When doing classification tasks, I normally use the prevalence (frequency) of the largest (modal) class as the naïve benchmark against which to compare if a single model is useful or not. For example, if my label is binary yes and no, with yes comprising 9% of the dataset and no comprising 91%, then I would expect the accuracy of a model to be at least 91%. If not, the model is no better than naively assigning all predictions to the larger class. The same logic applies for multiple categories (e.g. three or four classes for prediction). For example, if there were three classes A, B and C distributed 30%, 40% and 30%, then the prevalence of the largest class (B) would be 40%.

My request is that the Performance (Classification) and Performance (Binominal Classification) operators would add this as an option for criteria that they output.I am not sure, but I think the formal name for this measure is "prevalence of largest class" (c.f.https://en.wikipedia.org/wiki/Prevalenceandhttps://en.wikipedia.org/wiki/Confusion_matrix#Table_of_confusion。Because the calculation is so simple, I hope it would be easy to implement. Yet having this handy as an output option would be more convenient than pulling out a calculator each time, which is what I have to do now.




Tagged:

Best Answer

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,355RM Data Scientist
    Hi,
    I usually use Cohen's Kappa for this?https://en.wikipedia.org/wiki/Cohen 's_kappa
    This is basically 'How much am I better than the default classification'.

    The other thing I am frequently doing is to calculate the accuracy/ROI of a default model. The default model maybe the 'naive' prediction of predicting the majority class. Have a look at the Default Model operator for it.

    Best,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • TripartioTripartio MemberPosts:37Maven
    Hi Martin,
    The accuracy of the Default Model set to the "mode" indeed gives me exactly what I am looking for. But this means a completely different operator and half a process to give me just one number that I need every time that I run a classification. So, as I said, the calculation is very simple; my request is to make it readily accessible right where I need it.
    I tried the kappa (Cohen's kappa), but I have no idea what that is supposed to tell me in my case. Could you please clarify how its interpretation would answer my question of if a given model is better than the accuracy of the default model of the mode?
    Regards,
    Chitu

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,355RM Data Scientist
    i totally understand your point and what you are looking for. Have a look athttps://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english。Maybe this also satisfies your needs.

    Best,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • TripartioTripartio MemberPosts:37Maven
    Thanks for the Stack Exchange link that explains how to use and interpret Cohen's Kappa. That is helpful. However, while it does introduce me to a useful alternative to accuracy as a performance measure, it does not substitute my request for a benchmark for which to evaluate what is a useful model. (The best answer provided admits that there is no objective standard of what is a good kappa; it can only be used to compare two or more models, not to compare a model with itself.)
    So, my request for adding prevalence of largest class as an option in Performance (Classification) remains.Would that be possible?
    Regards,
    Chitu

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,355RM Data Scientist
    edited November 2020
    here is a process which takes a data set and calculates your prevalance.I am cheating a bit in the end to also call it correctly*.We can turn this into a Custom operator with like 5 clicks.

    Best,
    Martin

    *: Propably you dont want to do this...















































































    <参数键= " momentum_stable“value="0.0"/>
































    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • TripartioTripartio MemberPosts:37Maven
    Thanks for the process; that is very interesting. Could you please point me to the documentation for how I would turn it into a custom operator? Actually, I did create the custom operator (I think), but I cannot figure out how to add it to a new process.
    In any case, thank you for submitting a feature request ticket for my original request. That is really what I would like--to have prevalence added to the list of options for display, rather than having a dedicated operator just for that.
    Regards,
    Chitu
  • TripartioTripartio MemberPosts:37Maven
    To broaden this request, perhaps RapidMiner could consider a way to let users add custom measures to the Performance operators. I'm thinking of something like the functionality in the Generate Attributes operator, which would let the user write the expression formula for the operator they want. Without something like that, people like me will probably keep on asking for our own preferred measures to be added to the list.
    Just an idea.
    Regards,
    Chitu
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,355RM Data Scientist
    edited November 2020
    Hi,

    And there is a way to build custom measures. The Extract Performance operator allows you to define any performance measure you want.

    Best,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign InorRegisterto comment.