"Webmining: keep only documents that contain certain keywords"
Hello,
how can I filter documents from an input stream that match one or more keywords in a collection of keywords that are stored in a wordlist or similar? Filter Documents (by Content) is not the right solution as the filter-keywords have to be hardcoded into the operator (See example below). I would rather that the filter uses a second inputstream that ca be easily manipulated.
Regards
mrpopper
how can I filter documents from an input stream that match one or more keywords in a collection of keywords that are stored in a wordlist or similar? Filter Documents (by Content) is not the right solution as the filter-keywords have to be hardcoded into the operator (See example below). I would rather that the filter uses a second inputstream that ca be easily manipulated.
Regards
mrpopper
http://www.sueddeutsche.de/"/>
<参数键= "域" value = "服务器" / >
<参数键= " max_threads " value = " 75 " / >
[glow=red,2,300][/glow]
Tagged:
0
Answers
if you would store the keywords in a file (e.g. CSV, Excel) you could read them into your process and either filter the document for each keyword (Loop Examples or Loop Attributes (this depends on the way your file is built) combined with Filter Documents) or build one regular expression from the keywords (words separated by vertical bar: (word1|word2|word3) ) and use this for the Filter Documents operator (with parameter "condition" set to "matches").
If you already have your keywords as wordlist you could convert this into an example set and use a similar approach.
Regards,
Matthias
you could also use a word list of your target key words as an input word list for the "Process Documents from Files" or "Process Documents from Data" operator. Then a simple aggregation of the document vector values of a document indicates of at least one of the key words was contained in the document and hence document filtering becomes very easy.
Best regards,
Ralf
thank you for your input, but I cannot follow you. What do you mean by aggregation of the Word Vector? Which Operator should be used? And how do I join the aggregated value against the Word Vector of my "to be searched" documents?
Regards,
Heiko