"Retaining selected word pairs when tokenizing"

carlcarl MemberPosts:30Guru
edited June 2019 inHelp

When tokenizing into single word tokens, is there a way to keep selected pairs of words together as a single token?

For example, in soccer the term "centre forward" makes more sense as a single token. I looked at n-grams, but this pairs words that I do not want to pair. I tried using the stem dictionary, but this seems not to work across multiple tokens, and if I put the stem before tokenize, e.g. to change centre forward to centre-forward, this doesn't appear to work.

Best Answer

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Solution Accepted

    Hi Carl,

    All observations are correct. Since there is no replace operator across multiple tokens, I think you have to apply the Replace operator on the data set in your case. The other options do not seem to be really feasible here.

    But don't worry, you can actually do this by first transforming your document into an example set, perfom the replacement, and transform it back into a document. The process below shows you how you can do this. Please note that you either need to change your tokenization to something else than "non letters" or you need to use letters as the delimiter in your replacement (or just no delimiter at all).

    This is probably not winning a first price for elegance but it does the job :smileywink:

    Hope that helps,

    Ingo



















    <参数键= " replace_what" value="political correctness"/>
    <参数键= " replace_by" value="politicalDELIMcorrectness"/>


    <列出柯y="specify_weights"/>



    <参数键= = "“字符”价值。: " / >













Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder

    Hi,

    you could apply a "Replace" operatorbeforeyou tokenize. Let's assume your text documents are initially stored as values in a nominal / text column. Then you can use "Replace" to, well, replace "centre forward" by "centre_forward" which will be kept as is in a later tokenization.

    Hope that helps,

    Ingo

    Telcontar120
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    You could do this a couple of different ways. First, you could use n-grams and then a custom stopword dictionary after that to remove the n-grams that you are not interested in (it just requires a text file as input so if you output the word list after the n-gram using "Wordlist to Data" then you should be able to copy/paste the relevant items into a text file fairly easily). This is probably the way I would do it if I had a large number of substitutions to make.

    Another approach would be to use a stem dictionary. It sounds like you tried a variation of this, but you would want to place it after the n-gram operator and after tokenize. I don't see why that approach wouldn't work, although I haven't tried it.

    A third option if you have only a few of these substitutions to make is simply to use the "replace token" operator, which allows you to use regular expressions for your substitution search.

    I hope this helps!

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder

    Hi Brian,

    Just to add on the last one: the problem is that "Replace Token" only works on single tokens, so if you already have tokenized the text, the two words are now separated into two tokens and can no longer be replaced...

    Cheers,

    Ingo

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Yep, sorry, I should have clarified that in this instance you can use "replace token" after you have generated n-grams, so you could turn "centre_.*" into "centre_forward" or similar.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    IngoRM
  • carlcarl MemberPosts:30Guru

    Thanks Brian / Ingo. Sorry, I should have attached my process with the question. Looking at the different options:

    1 -Replacebefore tokenization acts on an example set, and I'm initially processing a document.

    2 -n-gramswould create too many pairings that I would not be interested in, and if I deleted these, my frequency count would understate certain words if they'd been part of the n-grams.

    3 -Stemmingafter tokenization would require a lot of patterns, and I'd need to re-aggregate the word frequency after breaking up the n-grams that I'm not interested in.

    4 -Replaceafter tokenization and n-graming would have a similar effect as per 3.

    For the most part, I'm interested in single words, with just a few exceptions where compound nouns (or concepts) make more sense than the inidividual words. And I wanted to see if I can distill a PDF in as few steps as possible. So ideally aReplaceacting early on a document to hyphenate the concepts I want to retain as tokens might be ideal if that were possible.



















    <运营商激活= " true " class = "文本:filter_stopwords_english" compatibility="7.3.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="581" y="34"/>










































  • carlcarl MemberPosts:30Guru

    Thank you. That worked well. I used Replace(Dictionary) so I could create a small number of replacements (via an Excel).

Sign InorRegisterto comment.