Text Tokenization Using Regular Expression For Text Mining

onurer007onurer007 MemberPosts:1Contributor I
edited November 2019 inHelp
Hello,
I have a problem and i need your help, please.
I want to tokenize a unstructured document using regular expression. I have a text file where each rows include a sentence such as:

1.String1 String2 String3 String4 String5
2.String6 - String7 - -
...
n.String8 - String9 String10 - (assume string2 and string5 dont exist.)

What I exactly want to do is that tokenization will extract each word and give the results in a table in Excel format such as:


S1 S2 S3 S4 S5
1. String1 String2 String3 String4 String5
2. String6 - String7 - -
3.
..
n. String8 - String9 String10 -


which operators and and which regular expression structure can i use in Rapid Miner?
Thank you for your help in advance.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    如果你的原始文档包含the dashes you can simply read it with Read CSV and specify all blanks (space, tab, etc.) as column separator.

    Best regards,
    Marius
Sign InorRegisterto comment.