Self-modifying stop-list/n-gram filter in text mining?

batstache611batstache611 MemberPosts:45Guru
edited December 2018 inHelp

I'm working on building a process that would webscrape websites of a selected industry and hunt for industry-specific keywords in the collected text. The difficulty I'm facing in this case is that when I want to look for phrases or n-grams, there are a lot of them that are just rubbish. In the final output, I would like to see n-grams that only contain those specific keywords in the phrases, followed or preceeded by (in certain cases) words that otherwise might have been simply filterd out or do not provide any valuable insight on their own.

E.g.Ship-building industries-> sonar_systems. Normally I would not be interested in the word "systems"as it does very little in the way of giving me something meaningful with respect to the industry, but the wordsonar, and the n-gramsonar_systems, is decently valuable to me from an analysis point of view.

So basically I could either have a stoplist that self-populates itself (somehow!) by looking at intermediate results fromProcess Documents from Data/Filesand then only passes the relevant n-grams for further analysis such as text clustering, or association rules, etc. OR I could find some clever way offilteringcertain n-grams before they are being passed to other nodes.

Any ideas on how to do this? Thank you very much!

P.S. I do not have an end to end process, just the text-ming part of it. Configuring a crawler with exception handling isn't that much of a problem. And if I put this in a server, I can make some kind of an app with this.







<运营商激活= " true " class = "process" compatibility="7.5.001" expanded="true" name="Process">

<运营商激活= " true " class = "text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="112" y="391">









<运营商激活= " true " class = "text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="85"/>
<运营商激活= " true " class = "text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="85"/>
<运营商激活= " true " class = "text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="380" y="85"/>
<运营商激活= " true " class = "text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="514" y="85">



<运营商激活= " true " class = "text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="648" y="85"/>
<运营商激活= " true " class = "text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="782" y="85">
<参数键= "文件" value = " C: \ \帕里\ Documen用户ts\Odin\Aero Stoplist.txt"/>


<运营商激活= " true " class = "text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (2)" width="90" x="916" y="85">
<参数键= "文件" value = " C: \ \帕里\ Documen用户ts\Odin\Aero n-gram Stoplist.txt"/>














<运营商激活= " true " class = "numerical_to_binominal" compatibility="7.5.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="340"/>
<运营商激活= " true " class = "fp_growth" compatibility="7.5.001" expanded="true" height="82" name="FP-Growth" width="90" x="514" y="238">




<运营商激活= " true " class = "create_association_rules" compatibility="7.5.001" expanded="true" height="82" name="Create Association Rules" width="90" x="648" y="238">



<运营商激活= " true " class = "converters:rules_2_example_set" compatibility="0.3.000" expanded="true" height="82" name="Association Rules to ExampleSet" width="90" x="782" y="34"/>

















Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    I did a self populating stopword list for a customer 2 years ago and but I can't find it. If memory serves, we text mined a corpus of insurance documents with pruning and used n_grams. From there we exported the Wordlist and saved it to a repository.

    That repository would be looped back in and appended the stopword list that would be used to do the next iteration. It was quite complex but I do remember using Loops for this and possible the Remember and Recall operators too.

  • batstache611batstache611 MemberPosts:45Guru

    Thank you@Thomas_Ott, I will try to build something along the lines of what you said. Have a great day!

Sign InorRegisterto comment.