"Split a document in its sentences"
mahdilashkari
MemberPosts:3Contributor I
It has bothered me because I think it should be simple but I have not found any solution for it.
I have a document and I want to split it in its sentences. I used tokenizer component and it only gives me a colored document. I mean I want to extract its sentences and put each sentence in a example set. my process is listed at below.
thanks a lot!
I have a document and I want to split it in its sentences. I used tokenizer component and it only gives me a colored document. I mean I want to extract its sentences and put each sentence in a example set. my process is listed at below.
thanks a lot!
Tagged:
0
Answers
your solution has solved my problem but i want to have the main text as an attribue for each splitted sentence. what should i do?
thanks a lot gain.
- use an example set that contains all documents
- add an ID to the data
- split the data using Cut Document*
- activate keep_text in Process Documents
- Join the original document data to the split data
*) In Cut Document, you have to manually specify a regular expression that detects linguistic sentences. As you can see, the one I invented is not yet perfect.