How to obtain a words' list relating to word containing wildcard *, ?, #

EL75EL75 MemberPosts:43Contributor II
Hello rapidminer community,
Hope you're doing well in this crisis period...

Here's my topic, and thank you all for your help and advice.
As said in the subject, I'd like to obtain a words' list relating to words containing wildcard *, ?, #.
The reason is that I’m trying to migrate a dictionary from a platform to another.
In my original dictionary, I have words with wildcard *, ?, #
The new platform doesn’t accept such characters and force me to create a single line for each declination.
These wildcards can be associated with part of words or in word sequences.
Using a snowball « * » allow me - in my present dictionary - to capture all part of texts relating to these variations (pluriel, gender, grammatical declinaison, etc.).

For example, SUPPORT* will mean SUPPORT, SUPPORTS, SUPPORTING, SUPPORTIVE, SUPPORTER, etc.
While the following word pattern: *SUPPORT* will also substitute all words with the substring "SUPPORT" in it, such as UNSUPPORTEDLY, UNSUPPORTED, etc.
An expression that includes several words may also be substituted by joining the various words with underline characters. For example, the expression "going out" GO*_OUT.
But my needs go beyond the snowball as wildcards:
- « ? » is used to replace any unique character in a word,
- « # » is used to replace any number « ## » for two numbers, etc.

Therefore, I need to migrate my actual dictionary (French words) that contains thousands of rows with ITEMS containing wildcards: is there a solution that could allow me to give, for each such word, all the corresponding words?

我相信这是一个棘手的thing to solve.... but I'm stuck in the process and can't move forward.
I would be so happy to find a solution:)

Thank you so much in advance for any help.

Tagged:

Best Answer

Answers

  • kaymankayman MemberPosts:662Unicorn
    Have you tried with regex?
    You will have some translation to do from old format but in essence the logic remains (more or less) the same...

    The rapidminer stem operators (part of the text processing add ons) do allow * also, not sure about the other wildcards as I've never used them but did you give these a try also?

  • EL75EL75 MemberPosts:43Contributor II
    edited November 2020
    Hi Keyman,
    thank you so much for your help!
    Could you precise how could REGEX find all declensions for a lemma?

    For example, a REGEX should:
    - start reading the firs row and find in the column named 'ITEM' of my dictionary a word containing an asterisk (*)
    - after findingSUPPORT*, then find all declensions in a french words list (I have different ones) the words e.g SUPPORT, SUPPORTS, SUPPORTING, SUPPORTIVE, SUPPORTER, etc.
    - then create lines for each new word in my dictionary
    - add the words found in the column ITEM
    and continue the process until the last row…

    An other point is that I can have an entry in the dictionary that contains multiple words, some of them can contain a « * » and/or a « ? » e.g the french expression: «temp* d?ecran*» . this returns all verbatims dealing with time spent on screens including misspellings and the use or not of accented characters that are frequent in french (temps d’écran, temp d’ecran, temps d’écrans, etc.). Such cases are so frequent in french, that, when realizing semantic analysis of verbatims, it is really useful to capture all those expressions in the same « folder ».

    For words words containing a « ? » would it be an identical process, considering that the « ? » can be elsewhere in a word? For instance, in french, as I’m working on dataset of verbatims coming from social networks, I have entries in my dictionary that allow me to capture different ways people write - including misspellings - for the same theme. this entry of words «sans_que_?es_parent*_le_voi*» is a good exemple of the global issue of my dictionary’ s migration:
    - « ?ES » should return « tes, des, les, mes »
    - "parent" ; "parents"
    - « voi* » should return « voie, voient, vois, voit etc.)

    I thought perhaps a rapidminer process - combined with REGEX- could allow me to do that?

    Thank you for your help. I’m quite beginner in coding:(and really stuck in this migration process...
    best regards
  • EL75EL75 MemberPosts:43Contributor II
    hi Kayman, I'm so glad reading your answer, that is very precise, and pedagogic.
    I’ve read in details, and with your help, I feel confortable starting playing with regex !
    Would you please tell me how (within rapid miner) I can implement the target of the research of regex expression? I mean that the regex formula searches words in the orignal dictionary and, on the other side, it has to find the words in another file?
    best regards,
  • kaymankayman MemberPosts:662Unicorn
    I lost you a bit :-)
    Could you share some examples, like reduced source files and elaborate what you would like to get as outcome? It's easier this way to get an overall view on the problem and potential solution.
  • EL75EL75 MemberPosts:43Contributor II
    Hi Kayman,
    thank you very much for your help.
    I've took some time to find the right way, and you're right, regex are powerful.
    best regards,
    PS : I've post a new question regarding encoding apostrophe when exporting CSV file, in case you know how to do..

Sign InorRegisterto comment.