"text processing results into decison tree?"

margkwmargkw MemberPosts:14Contributor II
edited June 2019 inHelp
Hey guys.After having tokenized some pdf documents, I now want to use the results and to induct a decision tree.Any ideas how this can be done? As I saw the induction tree operator needs an exampleset as input.How do I generate this from my results?
Thanks in advance

Answers

  • kasper2304kasper2304 MemberPosts:28Contributor II
    Can you give a short description of which nodes you used?
  • margkwmargkw MemberPosts:14Contributor II
    Hi!
    Thanks for the reply.

    I tried to use the "decision tree" operator which is contained in the decision tree induction, under the category modeling. Actually I have no idea on how to do that. I am new.

    For doing the tokenization of the pdfs I have used the operator "process documents from files" and into that I used the "tokenize"operator.
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Hi,

    you are probably using one of the Process Documents operators. Those operators output an example set, which you can use to induce a decision tree. However, in the field of text classification you usually have a huge amount of attributes (actually one attribute for each word in your corpus). Decision trees, on the other side, perform quite bad on data with many attributes. Instead, you should consider a linear SVM instead.

    If you have problems setting up the process, please post the xml of what you have so far as described in my signature.

    Best regards,
    Marius
  • margkwmargkw MemberPosts:14Contributor II
    thank you marius.I will try that.
  • margkwmargkw MemberPosts:14Contributor II
    Hi marius!!!

    这是xml的过程







    <运营商激活= " true " class = "process" compatibility="5.2.008" expanded="true" name="Process">

    <运营商激活= " true " class = "text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">













    <运营商激活= " true " class = "text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="86" y="177"/>
    <运营商激活= " true " class = "text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="165"/>




















    Now I want to insert a decision tree operator. I have saved the example set that the previous process created, and in a different process I did the following which is not working







    <运营商激活= " true " class = "process" compatibility="5.2.008" expanded="true" name="Process">

    <运营商激活= " true " class = "retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="71" y="281">


    <运营商激活= " true " class = "decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree" width="90" x="313" y="165"/>










    any ideas?
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    As stated above, for text classification decision trees are far from being the optimal choice. However, please specify a bit more detailed what exactly is not working, what you are expecting and what actually happens.
    Without knowing your expectations and your data it's hard to see where your problems occur.

    Best regards,
    Marius
  • margkwmargkw MemberPosts:14Contributor II
    My data are ten folders of pdf files. With the first process I am tokenizing them and I also do a stopword filtering. After doing that, I saved the example set that had been created, and I tried to make a decision tree (At the second process) which would help me see some kind of pattern in those documents.. For example if we see the word "process" and the word "network" and the word "on line" it will lead us to the 6th folder. I was asked to do that by making a decision tree and by association rules.
    Two separate ways.

    I know I have made to separate processes (one for the tokenization and one for the tree.) . Maybe this could be done with a single one..
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    If you do it with one or with two processes does not matter. But it is generally a good idea to separate preprocessing and model creation, so you are fine with using two processes.

    To get these indicator attributes/patterns, usually the Decision Tree is a good choice, however, with so many attributes, it may be of limited use. Anyway, it should work - which error do you get when running the process that creates the tree?

    Instead of using the tree, you could also create a Linear SVM model for each of your 10 classes which separates that class from all other classes (keyword "1 vs. all classification"). When inspecting the model you will see weights associated with each attribute/word. Great absolute values there indicate a strong influence of that word - if the weight is negative for one class, if positive for the other class.

    Best regards,
    Marius
  • margkwmargkw MemberPosts:14Contributor II
    First of all thank you Marius for all the great help.

    When trying to use the decision tree with my example set , I get an error that says that metadata is underspecified..:/No idea why this happens.

    我也将尝试什么you indicated again tomorrow . I hope it will work, so I can give it as an alternative solution.
    About the association rules it goes the same way?

    Thank you again!
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Just hit the Run button, and your process will run. The problems with at the bottom only lists *possible* problems, but sometimes it is too pessimistic and the process runs fine nevertheless.

    Happy Mining!

    ~Marius
  • margkwmargkw MemberPosts:14Contributor II
    margkw wrote:

    First of all thank you Marius for all the great help.

    When trying to use the decision tree with my example set , I get an error that says that metadata is underspecified..:/No idea why this happens.


    Thank you again!
    it also says cannot check precondition/( ????)
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    That only means that the decision tree does not know what kind of data it will receive until you actually execute the process. That is because the text processing has to read all the documents to know which words will be part of the text body, and that is done only when executing the process. Until then, the so-called meta-data is unknown, resulting in the quoted error.

    Just ignore it and try to hit the big blue Run button.
    If an error occurs during actual execution, please let us know and we'll try to give you further assistance.

    Best regards,
    Marius
  • margkwmargkw MemberPosts:14Contributor II
    you were totally right about the decision tree!it worked, thank you..Another text mining question now..While I am tokenizing a file, which is the best filter to use to remove certain words that occur too often? I am already using the "filter stopwords" operator, but I need to remove more..If I use the filter by content operator can I remove multiple words?

    edit: I solved this problem by using the operator multiple times. If there is a more efficient way please inform me ..
    另一个问题。我想提取结果(the wordlist actually) into an xls format.Is that possible? I am searching for such an option but I cannot find it.
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Hi,

    you can experiment with the prune parameters of the Process Documents operator to remove words that appear too often/too seldom.

    Best regards,
    Marius
  • margkwmargkw MemberPosts:14Contributor II
    Is there a way to extract the results into excell form?
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    You can write an example set, i.e. a data table, to an Excel file with the Write Excel operator.

    Does that help you?

    Best regards,
    Marius
Sign InorRegisterto comment.