"Split a document in its sentences"

mahdilashkarimahdilashkari MemberPosts:3Contributor I
edited June 2019 inHelp
It has bothered me because I think it should be simple but I have not found any solution for it.
I have a document and I want to split it in its sentences. I used tokenizer component and it only gives me a colored document. I mean I want to extract its sentences and put each sentence in a example set. my process is listed at below.
thanks a lot!



































Answers

  • SkirzynskiSkirzynski MemberPosts:164Maven
    The tokens are colored to see a preview of the tokenization. If you want to create an example set you will have to create a word list (which is in your case a sentence list) and with the "Wordlist to Data" operator you will get an example set. See process below:



































  • mahdilashkarimahdilashkari MemberPosts:3Contributor I
    thanks marcin
    your solution has solved my problem but i want to have the main text as an attribue for each splitted sentence. what should i do?
    thanks a lot gain.
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    In this case you have to:
    - use an example set that contains all documents
    - add an ID to the data
    - split the data using Cut Document*
    - activate keep_text in Process Documents
    - Join the original document data to the split data

    *) In Cut Document, you have to manually specify a regular expression that detects linguistic sentences. As you can see, the one I invented is not yet perfect.



























    <连接from_op = "添加" from_port =“合并”to_op="Nominal to Text" to_port="example set input"/>














    < =“query_type”价值=参数的关键"Regular Expression"/>





































Sign InorRegisterto comment.