"tokenize and keep words with dash"

johannesweberjohannesweber MemberPosts:1Contributor I
edited June 2019 inHelp
Hello,

is there any way to tokenize into single words and don't split words with a dash?

For example, I want to keep the word "state-of-the-art" instead of having four words afterwards.

I saw the option to change the operator's mode to "specific characters", however I don't understand the syntax requiered.

I would much appreciate an answer.

Best regards

Johannes

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Specific characters is fine, just list the characters that indicate word borders, e.g. dot, comma, space, questionmark etc.: "!? ,.". Think carefully and check the results to not forget any important delimiters:)

    Best regards,
    Marius
  • HelenZHelenZ MemberPosts:3Contributor I
    This is a really good suggestion and very helpful. I tried using the "." to tokenize my document. But now, I face the Problem that a sentence containing e.g. the word "u.s." is tokenized right in the middle because u.s. contains a dot. Or to take another example a sentence containing the number "1.3%" is split.

    So is there a way to also include exceptions in the mode "specific characters" and what regex term do I use then? Or do I have to add another operator or something?


    Thank you for your great help. This is very much appreciated.


    Helen
Sign InorRegisterto comment.