Cut Document

CaptainChaosCaptainChaos MemberPosts:17Maven
edited June 2019 inHelp
Hi Guys,

I think i have some kind of trivial problem but couldnt figure out how to solve it.:'(
我处理路透数据集时,我有一个eemed version consisting of one big docuement which contains all the other documents. So it is a big .txt file in which the beginning and ending of each document is marked by the word "reuter". I tried to use the "Cut Document" operator to split them. As query expression I used "reuters" the problem is that all documents know have the same name(label) which makes it hard to work with them.

Does anybody know how to give different names to all documents like 1,2,3,4,5 for example and than write/export them to excell or a data base.

Thanky in advance
cheer

Answers

  • colocolo MemberPosts:236Maven
    Hi,

    here is a little example of how you could write the single documents as files:







    <运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">

    <运营商激活= " true " class = "text:create_document" compatibility="5.1.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">


    <运营商激活= " true " class = "text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">


















    <运营商激活= " true " class = "loop_collection" compatibility="5.1.011" expanded="true" height="76" name="Loop Collection" width="90" x="313" y="30">

    <运营商激活= " true " class = "text:write_document" compatibility="5.1.001" expanded="true" height="60" name="Write Document" width="90" x="45" y="30">


















    If you prefer a list-based output like Excel or database, this is the way to go:







    <运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">

    <运营商激活= " true " class = "text:create_document" compatibility="5.1.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">


    <运营商激活= " true " class = "text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">


















    <运营商激活= " true " class = "text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="313" y="30">



    <运营商激活= " true " class = "generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
    <运营商激活= " true " class = "generate_attributes" compatibility="5.1.011" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="30">




    <运营商激活= " true " class = "write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="581" y="165">














    Hope these examples help you a little. Feel free to ask if you have further questions.

    Regards
    Matthias
  • CaptainChaosCaptainChaos MemberPosts:17Maven
    Hi Matthias,

    first off all thank you very much for your help my model no works a lot better than before.
    But I would like to ask you one more question. In the next pic I copied your code and marked one line which is different to my once could you explain the line to me.










    <运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">

    <运营商激活= " true " class = "text:create_document" compatibility="5.1.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">


    <运营商激活= " true " class = "text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">





    \s*(.*?)\s* --> Plural












    <运营商激活= " true " class = "text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="313" y="30">



    <运营商激活= " true " class = "generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
    <运营商激活= " true " class = "generate_attributes" compatibility="5.1.011" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="30">




    <运营商激活= " true " class = "write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="581" y="165">















  • CaptainChaosCaptainChaos MemberPosts:17Maven
    One more Question how can I use an extra window for my code like you did thanks a lot

    Kind regards Roberto
  • colocolo MemberPosts:236Maven
    Hi Roberto,

    you can use the code style by adding CODE-tags around it. It's the third symbol from the right just above the smileys.

    突出显示的行是我破裂的表达式。你的年代aid there is a word marking the beginning and the end of each document. Since I manually typed some example document contents, I simply used "marker" for this. I think it should be "reuters" in your case. The regular expression used to cut the text collects anything between two marker words (the first capturing group) and also uses \s* to cut of whitespace surrounding the content (a newline between marker word and beginning of the actual document content for example).

    Hope this clarifies things.

    Regards
    Matthias
  • CaptainChaosCaptainChaos MemberPosts:17Maven
    Yes it did thanks a lot.... by the way do you know how i could tell rapidminer in the next step to treat each row in the excell sheet as a seperate document so that i could do some data to similarity or clustering. Thanky in advance for your time:)
  • FlakeFlake MemberPosts:13Contributor II
    @ Roberto, text in each row is perfect to carry on for further process e.g. clustering. You may refer to this. The excel I read contains two column (id,text), and each row is texts of one document.







    <运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">

    <运营商激活= " true " class = "read_excel" compatibility="5.1.011" expanded="true" height="60" name="Read Excel" width="90" x="149" y="177">



    <运营商激活= " true " class = "text:process_document_from_data" compatibility="5.1.002" expanded="true" height="76" name="ProcessDocs Train" width="90" x="514" y="75">



    <运营商激活= " true " class = "text:transform_cases" compatibility="5.1.002" expanded="true" height="60" name="Transform Cases" width="90" x="204" y="275"/>
    <运营商激活= " true " class = "text:replace_tokens" compatibility="5.1.002" expanded="true" height="60" name="Replace Tokens" width="90" x="259" y="152">





    <运营商激活= " true " class = "text:tokenize" compatibility="5.1.002" expanded="true" height="60" name="Tokenize" width="90" x="315" y="30"/>
    <运营商激活= " true " class = "text:filter_stopwords_english" compatibility="5.1.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="450" y="30"/>
    <运营商激活= " true " class = "text:stem_snowball" compatibility="5.1.002" expanded="true" height="60" name="Stem (Snowball)" width="90" x="585" y="30"/>
    <运营商激活= " true " class = "text:filter_by_length" compatibility="5.1.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="720" y="30">















    <运营商激活= " true " class = "k_means" compatibility="5.1.011" expanded="true" height="76" name="Clustering" width="90" x="849" y="75">














Sign InorRegisterto comment.