non-binomial target label column in decision tree to measure accuracy

koknayayakoknayaya MemberPosts:20Contributor I
edited June 2020 inHelp
how i want to measure the accuracy of my model if my target label column is not a binomial attributes? it is not in (yes/no) type. but it consists of crime types such as burglary, robbery, fraud, assault etc.

Best Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Solution Accepted
    Hi,
    This cannot be said in general and depends on your business problem and how you solve things today. Think about predicting the outcome of a coin flip. If you make random guesses, your accuracy would be 50%. If by using machine learning you can predict the outcome with 51% accuracy, this would be sufficient. Why? It does not sound like a good model with only 51% accuracy? Wrong! Because you can now start betting against people without this model (who only have 50% accuracy) and will become rich over time:smile:
    It looks like you have multiple classes, let's say 5 for the sake of the argument. If the classes would be equally distributed, a random guess would lead to 20% accuracy or 80% error rate. Getting 62% accuracy (or 38% error rate) might be a fantastic result already - you just have been cutting your error rate down by 50%! Or not. Again, without understanding the business problem you want to solve this is impossible to say.
    If, however, you have your 5 classes and one of the classes is the correct class in 62% of all the cases, then a model with 62% is not very impressive in any case since always predicting that class (and never anything else) would lead do 62% accuracy already.
    You see there is no easy answer to this and only your or the owner of the business problem can decide if that is good or not. But comparing the value to the distribution of the class is at least a first step to determine if the model learned anything at all or not.
    Hope this helps,
    Ingo
    koknayaya
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Solution Accepted
    再一次,没有人能够告诉你如果这是good or not. Only you can decide (or whoever the business owner is).
    A couple of observations though: the classes RAPE and KIDNAPPING are (fortunately!) very rare events. There is only one "kidnapping" case in the whole test set and I would assume it is extremely rare in the training set as well. It is very unlikely that any model will ever be able to pick up this pattern if it is that rare. I would consider removing the class altogether.
    Although the class "rape" is more frequent, the problem here is similar and you again may decide to remove the class from the predictions altogether. If you do that, you would end up with only four classes ROBBERY, VEHICLE (something), BURGLARY, and DANGEROUS DRUGS. There is less chance that models are confused if the tiny classes are removed although it will likely not move the needle a lot. Anyway, every little bit may help.
    Now I would try a couple of different model types (starting with Auto Model first) and see where this gets you. You can then try to improve the performance of the best model(s) further with additional parameter optimization, feature engineering, or ensemble learning by opening the processes generated by Auto Model as a starting point for those optimizations.
    Finally, out of the roughly 9,000 examples in your test set above, about 4,000 have the true class DANGEROUS DRUGS. So always predicting this class is already delivering roughly 44% accuracy. A model with 62% is already much better than that obviously, but, again, if it is good enough depends on the underlying problem and its owners and is not a data science question per se. Also keep in mind that some prediction errors may be more costly than others. So accuracy is not the only thing which may be of importance here.
    Welcome to the world of data science - this is where the fun begins now笑脸:
    Best,
    Ingo
    koknayaya

Answers

  • koknayayakoknayaya MemberPosts:20Contributor I
    谢谢你的回答y question! One more thing, is it acceptable if accuracy of my prediction model is 62%?
  • koknayayakoknayaya MemberPosts:20Contributor I
    Thank you for your great answer! However this is my prediction:/The high percentage only on burglary and dangerous drugs. I want to predict the crime based on place
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    One problem is that these classes are not well balanced. For instance, the kidnapping category is almost completely useless, with only 1 example it is very unlikely to be picked up by any alogorithm. You might want to consider weighting by class to help the learner. Another option worth exploring would be combining and consolidating down to fewer categories to start, such as robbery, burglary, drugs, and all other.
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    koknayaya sgenzer
  • koknayayakoknayaya MemberPosts:20Contributor I
    Thank you for the answer! Im so excited to learn more!!
    sgenzer
  • koknayayakoknayaya MemberPosts:20Contributor I
    @IngoRM @lionelderkrikor @Telcontar120Thank you so much for the amazing answers! It really helps!o:) <3
Sign InorRegisterto comment.