"Split a document in its sentences"

mahdilashkari · May 2013

It has bothered me because I think it should be simple but I have not found any solution for it.
I have a document and I want to split it in its sentences. I used tokenizer component and it only gives me a colored document. I mean I want to extract its sentences and put each sentence in a example set. my process is listed at below.
thanks a lot!

Skirzynski · May 2013

The tokens are colored to see a preview of the tokenization. If you want to create an example set you will have to create a word list (which is in your case a sentence list) and with the "Wordlist to Data" operator you will get an example set. See process below:

mahdilashkari · June 2013

thanks marcin
your solution has solved my problem but i want to have the main text as an attribue for each splitted sentence. what should i do?
thanks a lot gain.

MariusHelf · June 2013

In this case you have to:
- use an example set that contains all documents
- add an ID to the data
- split the data using Cut Document*
- activate keep_text in Process Documents
- Join the original document data to the split data

*) In Cut Document, you have to manually specify a regular expression that detects linguistic sentences. As you can see, the one I invented is not yet perfect.




























<连接from_op = "添加" from_port =“合并”to_op="Nominal to Text" to_port="example set input"/>














< =“query_type”价值=参数的关键"Regular Expression"/>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Split a document in its sentences"

Answers