"looping through regex matches and groups"

markus_dressel · February 2017

Hi community,

I might have an easy questions regarding handling regex matches.

I have a document (loaded with the document operator), and now I want to use a regex to retrieve a certain part of the document. My regex code got e.g. three matches. So when running rapidminer, all three matches will be shown together (appended/joined together). So my questions is, if there is a way to loop through all regex matches like I can do it in Java or Python ?

For example like:

import re s = "ABC12DEF3G56HIJ7" pattern = re.compile(r'([A-Z]+)([0-9]+)') for (letters, numbers) in re.findall(pattern, s): pass # do anything

This is just a sample code, and not my specific task. I just want to know how to loop through regex matches.

I hope my question is quite clear :-)

Best regards,

Markus

MartinLiebig · February 2017

Hi Markus,

have a look at the attached process. It builds something like this with operators. It uses the new 7.4 loop. There is for sure a way to built this with 7.3 as well.

~马丁




<输出/ >


<运营商激活= " true "类="process" compatibility="7.4.000" expanded="true" name="Process">

<运营商激活= " true "类="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document" width="90" x="112" y="238">


<运营商激活= " true "类="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">





<运营商激活= " true "类="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="136">





<运营商激活= " true "类="append" compatibility="7.4.000" expanded="true" height="103" name="Append" width="90" x="246" y="34"/>
<运营商激活= " true "类="concurrency:loop" compatibility="7.4.000" expanded="true" height="103" name="Loop" width="90" x="447" y="85">



<运营商激活= " true "类="extract_macro" compatibility="7.4.000" expanded="true" height="68" name="Extract Macro" width="90" x="112" y="34">






<运营商激活= " true "类="delay" compatibility="7.4.000" expanded="true" height="103" name="Delay" width="90" x="246" y="85">

Execution Order

<运营商激活= " true "类="text:extract_information" compatibility="7.4.001" expanded="true" height="68" name="Extract Information" width="90" x="447" y="136">

























<运营商激活= " true "类="text:documents_to_data" compatibility="7.4.001" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="136">


<连接from_op = "创建文档“from_port = " output" to_op="Loop" to_port="input 2"/>

kayman · February 2017

You can use the replace dictionairy operator for this purpose.

Easiest way to proceed is to create a csv containing the regex you want to use (the from atribute) and the replacement (the to atribute), tell the operator to use regular expressions and of you go. It will loop through the whole file and replaces content accordingly.

markus_dressel · February 2017

Hi,

thank you for the quick response and provided solution. I have loaded your solution but maybe I have not correctly described my problem:

Lets say, we have a document with the following text:

Item here is some important text Item

Here is no important text

项目是一些额外的重要的文本条目

If I will use the regex: "(?s)(?i)Item.*?Item" , I have two matches

1: Item here is some important text Item

2: Item here is some additional important text Item

Seehttps://regex101.com/r/WYn2nm/1

So the question is, how can I loop througheachmatch and do some stuff with it, keeping in mind that the amount of matches is not static in different documents.

Something like that

for match in regex.matches:
if len(match) > 7:
do stuff
Else
do other sutff

Best regards and thank you for your great support

Markus

kayman · February 2017

I see. As you stated you know how to do it in python so how about using an execute python process? You just create your regex script, pump your data through it and you are covered.

Should be pretty simple this way, probably you can achieve it with plain RM vanilla but without having a clear idea on the data you have and what you want to achieve it's a bit complex to support.

Something like this :





<输出/ >


<运营商激活= " true "类="process" compatibility="7.3.000" expanded="true" name="Process">

<运营商激活= " true "类="python_scripting:execute_python" compatibility="7.2.000" expanded="true" height="82" name="regex_on_steroids" width="90" x="313" y="34">

Howdy, Stranger!

Quick Links

Categories

RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"looping through regex matches and groups"

Answers