How to obtain a words' list relating to word containing wildcard *, ?, #
Hello rapidminer community,
Hope you're doing well in this crisis period...
Here's my topic, and thank you all for your help and advice.
As said in the subject, I'd like to obtain a words' list relating to words containing wildcard *, ?, #.
The reason is that I’m trying to migrate a dictionary from a platform to another.
In my original dictionary, I have words with wildcard *, ?, #
The new platform doesn’t accept such characters and force me to create a single line for each declination.
These wildcards can be associated with part of words or in word sequences.
Using a snowball « * » allow me - in my present dictionary - to capture all part of texts relating to these variations (pluriel, gender, grammatical declinaison, etc.).
For example, SUPPORT* will mean SUPPORT, SUPPORTS, SUPPORTING, SUPPORTIVE, SUPPORTER, etc.
While the following word pattern: *SUPPORT* will also substitute all words with the substring "SUPPORT" in it, such as UNSUPPORTEDLY, UNSUPPORTED, etc.
An expression that includes several words may also be substituted by joining the various words with underline characters. For example, the expression "going out" GO*_OUT.
But my needs go beyond the snowball as wildcards:
- « ? » is used to replace any unique character in a word,
- « # » is used to replace any number « ## » for two numbers, etc.
Therefore, I need to migrate my actual dictionary (French words) that contains thousands of rows with ITEMS containing wildcards: is there a solution that could allow me to give, for each such word, all the corresponding words?
我相信这是一个棘手的thing to solve.... but I'm stuck in the process and can't move forward.
I would be so happy to find a solution
Thank you so much in advance for any help.
Thank you so much in advance for any help.
Tagged:
0
Best Answer
-
kayman MemberPosts:662UnicornRegex is fairly similar but the syntax can be a bit scary at first...
In order the mimic SUPPORT* you'd need to use SUPPORT.*
The dot (.) basically means 'any character allowed', and the star (*) means as many as you can find
Which also means that if you want to filter on a single character the dot is sufficient.
So .ES returns TES,YES,NES etc.
Now is where the problems start, you want to have your search case independent so you need to explicitly tell the regex compiler to ignore cases, you do this by starting your query with the (?i) syntax.
So (?i).ES returns TES, but also Tes and TeS etc.
but it would also return honestly as it's just looking for es with a character in front...
这就是边界来to play, where you tell the regex compiler where to start and/or end.
If you only have single words it would mean your script needs to start at the beginning of a line, using the carret symbol (^)
So (?i)^.ES will now match only words that start with a single character followed by ES without bothering about upper or lower cases...
Fine, that's with 1 character up front, what if we need more, say 2 or 3 before ES?
One way is to go wild with dots, so (?i)^...ES will match everything that has 3 characters before ES (dot dot dot) but that's not very flexible, so we can use ranges also. With curly brackets you can define a range, so this :
(?i)^.{1,5}ES will match everything that starts with 1 to 5 characters followed by ES.
Let's try above with some wildchards also, say you need to have any word that contains ES but it's not important if it's at he beginning or somewhere at the middle.
The star basically means 0 or more so
(?i)^.*ES means match everything that has ES, even if it starts with it
If you need to have at least one match you use the + so
(?i)^.+ES means match everything that has ES, but there needs to be at least one other character in front of it
Or if you want to have maximum one character in front you can use the questionmark, which means optional. Note that in regex a few special characters (.*?| etc) can have multiple useage, to keep it simple...
(?i)^.?ES means match everything that has ES, but there needs to be either just one or no characters in front of it.
So both TES and ES would match, but not SNES.
For the suffix the same goes. with the above it basically only matches words that end with ES, if you don't bother about what comes behind you need to use something like
(?i)^.*ES.* This basically matches everything that contains ES, wherever in a word.
Hope this get's you started, it's a very powerful language, and once you used it a few times also suprisingly easy, even if it looks different at this stage...
6
Answers
You will have some translation to do from old format but in essence the logic remains (more or less) the same...
The rapidminer stem operators (part of the text processing add ons) do allow * also, not sure about the other wildcards as I've never used them but did you give these a try also?
- "parent" ; "parents"
Could you share some examples, like reduced source files and elaborate what you would like to get as outcome? It's easier this way to get an overall view on the problem and potential solution.
thank you very much for your help.
I've took some time to find the right way, and you're right, regex are powerful.
best regards,
PS : I've post a new question regarding encoding apostrophe when exporting CSV file, in case you know how to do..