Process Documents operator ends up with more documents than example set

暴徒 · February 2016

I have an example set with 733 examples and with text as an attribute, I process it into TF-IDF using the data to documents and the process documents operators, inside the process documents I clean up and tokenize the text then I use the Write Document and I end up with 1466 files

Why do I end up with more twice as many files as examples?
How do I ensure that 1 document in = 1 document out ?

I have extract content set to negate every tag possible but I end up with 2 outputs for every 1 input. From a brief look it seems file 734 is simiilar to file 1 so its like the whole thing loops twice for some reason

JEdward · February 2016

Can you post up the XML of your process (or a simplified version) showing it?

That way I can properly see what's happening.

Thanks

暴徒 · February 2016

Can't share the data but here is the process to take a text field from an example set, pre-process it, write the documents to disk before outputting a td-idf example set






<宏/ >





Need to select the attributes in the pro settings of the Data to Document 






Process the entire dataset as if it were DIT data and use the merged stop phrase list to remove the boiler plate




Uses the regex from rapidminer forum to split where capitialised letters are in the middle of words because punctuation is missing from the original text. It finds captialised words and replaces them with a space and the captured text





<运营商激活= " true " class = "文本:filter_stopwords_dictionary" compatibility="5.3.002" expanded="true" height="76" name="Filter Stop Phrases" width="90" x="450" y="30">




Remove �� from the start of some words and sometime ���� and replace it with just the word found after it
Regular Expression Replacement
��{1,2}() $1




<运营商激活= " true " class = "文本:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Stop Eng (2)" width="90" x="313" y="187"/>












< from_op = "重新连接place Regex" from_port="document" to_op="Stop Eng (2)" to_port="document"/>








< from_op = "重新连接trieve" from_port="output" to_op="Data to Documents" to_port="example set"/>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Process Documents operator ends up with more documents than example set

Answers