text mining and words counting problem

PatrickHouPatrickHou MemberPosts:6Contributor I
edited December 2018 inHelp

Hi

I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?

Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?

Thanks

Patrick

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.

    For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • PatrickHouPatrickHou MemberPosts:6Contributor I

    Thanks for the reply!

    I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).

    For second question, is that means those "united" and "states" are not related to "united_states"?

    Patrick

  • PatrickHouPatrickHou MemberPosts:6Contributor I

    I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data. Do you have the "create word vector" parameter checked?

    The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction. So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • PatrickHouPatrickHou MemberPosts:6Contributor I

    I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.

    Thank you.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

Sign InorRegisterto comment.