Process Documents operator ends up with more documents than example set

暴徒暴徒 MemberPosts:37Contributor I
edited November 2018 inHelp
I have an example set with 733 examples and with text as an attribute, I process it into TF-IDF using the data to documents and the process documents operators, inside the process documents I clean up and tokenize the text then I use the Write Document and I end up with 1466 files

Why do I end up with more twice as many files as examples?
How do I ensure that 1 document in = 1 document out ?

I have extract content set to negate every tag possible but I end up with 2 outputs for every 1 input. From a brief look it seems file 734 is simiilar to file 1 so its like the whole thing loops twice for some reason

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn
    Can you post up the XML of your process (or a simplified version) showing it?

    That way I can properly see what's happening.

    Thanks
  • 暴徒暴徒 MemberPosts:37Contributor I
    Can't share the data but here is the process to take a text field from an example set, pre-process it, write the documents to disk before outputting a td-idf example set





    <宏/ >





    Need to select the attributes in the pro settings of the Data to Document






    Process the entire dataset as if it were DIT data and use the merged stop phrase list to remove the boiler plate




    Uses the regex from rapidminer forum to split where capitialised letters are in the middle of words because punctuation is missing from the original text. It finds captialised words and replaces them with a space and the captured text





    <运营商激活= " true " class = "文本:filter_stopwords_dictionary" compatibility="5.3.002" expanded="true" height="76" name="Filter Stop Phrases" width="90" x="450" y="30">




    Remove �� from the start of some words and sometime ���� and replace it with just the word found after it
    Regular Expression Replacement
    ��{1,2}() $1





    <运营商激活= " true " class = "文本:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Stop Eng (2)" width="90" x="313" y="187"/>












    < from_op = "重新连接place Regex" from_port="document" to_op="Stop Eng (2)" to_port="document"/>








    < from_op = "重新连接trieve" from_port="output" to_op="Data to Documents" to_port="example set"/>








Sign InorRegisterto comment.