Please help with document classification (nearest neighbor)! Total rapidminer novice here!

nathannathan MemberPosts:7Contributor I
edited December 2018 inHelp

Hi everybody,

I just found out about Rapidminer a few days ago. I work for a nonprofit and have been interested in using data science to help sort grant applications. I think it could really help. I tried following a few tutorials and guides and eventually I got here (the attached process). But when I run the process, I get an accuracy of zero. I think there's something wrong going on with the categories. But yeah I'm lost, because I don't really understand the application and have just been following instructions. But I'm hesitant to really get to know RapidMiner until I can see the results and know that investing time will get me where my organization needs to be.

I would really appreciate your help.

Thanks everyone

Help.rmp 11.1K
Help.rmp 11.1K

Answers

  • M_MartinM_Martin 类风湿性关节炎pidMiner Certified Analyst, MemberPosts:125Unicorn

    Can you provide your source Excel data file?

  • Thomas_OttThomas_Ott 类风湿性关节炎pidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    From inspecting the Read Excel operator, it looks like there is no metadata assigned. Did you go through the Import Wizard part of the operator? I would also select the attribute column that has the proposed text frield and not 'all.' Also, I would try a Naive Bayes first. K-nn with a k=1 is just asking for trouble.

    Yes, what@M_Martinsaid, a sample data file would be helpful.

  • nathannathan MemberPosts:7Contributor I

    This is the Excel doc I've been usingas my source data.

    I don't know what an Import Wizard is and I'll try a Naive Bayes analysis, but I'm very not confident in my abilities at this moment.

    Thanks for the responses!

  • Thomas_OttThomas_Ott 类风湿性关节炎pidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    What I want to understand is what is your ultimate goal with classifying this data?

  • nathannathan MemberPosts:7Contributor I

    I'm hoping to eventually be able to cluster / automatically classify documents so we can better identify patterns in grant applications. For example, if we find a large number of grant applications that mention "gardens", "mulch", "plants", etc. then we can identify which grantees are focused on urban farming and provide specialized assistance.

  • Thomas_OttThomas_Ott 类风湿性关节炎pidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    Without doing any classification work, I cleaned up the process a bit. The question remains, what are you trying to classify? year? organiziation?



































    <参数键=“机构名称”value = " id " / >












































  • nathannathan MemberPosts:7Contributor I

    Oh, I'm trying to classify the text. Text analysis is what I am attempting.

  • Thomas_OttThomas_Ott 类风湿性关节炎pidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    I get that, but what is the label? Meaning you process all the text into TFIDF vectors and then you want to use that information to later classify what? Organizations?

    Your data set has three attribute columns: year, organization, and text. Do you want to learn a model from the text attribute to help you indentify from what organizaiton it's coming from? Or from what year?

  • nathannathan MemberPosts:7Contributor I

    Gotcha - I see what you're asking. I'm hoping to use the text to classify organizations.

  • Thomas_OttThomas_Ott 类风湿性关节炎pidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    Then you're going to have a problem. You have 60+ organizations and your data is to 'thin' to give you an accurate classificaiton. You don't have enough example rows to to train on. Hence the 0% accuracy. Try for a more reasonable amount of classes, somewhere between 2 and 5 if possible with more examples for each class.

  • nathannathan MemberPosts:7Contributor I

    Would it help if each row had more text in it, rather than increasing the number of rows?

  • Thomas_OttThomas_Ott 类风湿性关节炎pidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    恐怕不会起到任何作用。在几乎所有cases you have one piece of text for each organization. Even when you process out the stop words and prune things, the model can't properly classify it at all. Just guessing here but you'll need probably 25 text entries per organization, which will increase your rows to over 1,700 examples. Even then you might get bad missclassificaiton because a lot of what each company writes appears to be similar to the other. The model gets confused.

    这儿有一个主意。我猜你在看which grants to go after for your organization. So look at your historical data and create two classes like "Go for this grant" or "skip this grant." You can then heavily weight the organizations you got the grants from an then feed it into a classifier. Hopefully then the model will be able to learn the patterns for a grant that you want to go after.

  • nathannathan MemberPosts:7Contributor I

    What do you mean when you say I have one piece of text for each organization? I'm really confused how adding more text to each organization wouldn't provide more data for the model to find patterns.

    I'm not looking to categorize by 'Go for this grant' or 'skip this grant'. I'm trying to categorize the grants we've already approved into groups like: community gardens, neighborhood cleanup, youth education, etc. Is this just outside of Rapidminer's capability with only 86 instances?

  • Thomas_OttThomas_Ott 类风湿性关节炎pidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    You don't have enough training data to do this. 86 training rows for 60+ different classes (organizations) will not work regardless if you use RapidMiner or something else.

    Why don't you add those categories like "community garden" "youth education", etc into the data set and use that as your label instead. If that's less than 5 or so categories, then you might be able get some better results. Still 86 training rows is pretty low.

  • M_MartinM_Martin 类风湿性关节炎pidMiner Certified Analyst, MemberPosts:125Unicorn

    Hi Nathan:

    Thanks for making your source data available. I have a few suggestions that I hope will be helpful.

    Given that the dataset is fairly small, and that there are only three data fields, I think you could get to where you want to be if you were to further classify and segment your data with additional relevant "data adjectives". Essentially, this will greatly help you classifiy the requests you receive moving forward from organizations you are hearing from for the first time. I think the classifications you make yourself will in the long run be more relevant than what RapidMiner (or any other Data Mining Tool) would provide.

    Yes, this will be time consuming and sometimes frustrating, but the segmentation choices you make yourself (and in collaboration with your fellow stakeholders) will, in my opinion, be very valuable to your organization in the long run, and will be very informative to you in the shorter term about the organizations you come in contact with.

    There are many BI and data visualization tools (like Tableau, for example) that could produce great reporting and vidually appealing analytical deliverables once you have a rich and descriptive data model in place.

    As I said before, once you have defined a clear segmentation map of the requests you receive, and communicate this "segmentation map" to people you work with, it will be much easier to classify organizations you come in contact with in the future, and these future classifications will be in aligned with segmentation policies you have developed yourself and fellow stakeholders.

    这类似于制造商和零售商如何gment their products into categories and sub-categories as part of analyzing sales of products. Data Mining / Data Science can be very helpful, but it is not a replacement for a rich and descriptive data model, and you have an opportunity to do your organization a real service by taking on the challenge of building a "segmentation map" for classifying the organizations you work with. Once the data is rich and descriptive, Data Mining / Data Science can often help take your understnading of your data to the next level.

    I spent a few minutes in a very rough draft attempt to classify some of the organizations in your data, that I thiink you (and your fellow stakeholders) could greatly improve upon. I've attached my attempt to this post as a .csv file, as the Rpaid Miner Studio forum does not allow posts in Excel format. You should be able to open this file in Excel and then save it as an Excel file.

    Good luck in your worthwhile work, and best wishes.

    Michael Martin

    SGolbert
  • M_MartinM_Martin 类风湿性关节炎pidMiner Certified Analyst, MemberPosts:125Unicorn

    Hi Nathan:

    One last thought related to my note of yesterday: in addition to adding "Project Classification" fields to your data model, you could also add one or more "Keyword Search" columns to your data model. I have attached a .csv file with a "Keyword Search" column to your original data.

    At the risk of being redundant, I believe that subject matter classification drives understanding and collaboration. Think of all the classification models related to biology, chemistry, psychology, and medicine - and they can be chnaged in light of new knowledge. A good classification model + good data science can sometimes greatly enhance the meaning and utility of the data

    Best wishes,

    Michael Martin

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Another approach here would be to ignore the organization entirely and just do unsupervised clustering. This can definitely be done using only the examples that you have, and you can specify the number of clusters you want. However, you will have little to no control over what clusters get generated as a result and they may not group things in the ways that you would like to think about them.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
Sign InorRegisterto comment.