"looping through regex matches and groups"

markus_dresselmarkus_dressel MemberPosts:5Contributor I
edited June 2019 inHelp

Hi community,

I might have an easy questions regarding handling regex matches.

I have a document (loaded with the document operator), and now I want to use a regex to retrieve a certain part of the document. My regex code got e.g. three matches. So when running rapidminer, all three matches will be shown together (appended/joined together). So my questions is, if there is a way to loop through all regex matches like I can do it in Java or Python ?

For example like:

import re s = "ABC12DEF3G56HIJ7" pattern = re.compile(r'([A-Z]+)([0-9]+)') for (letters, numbers) in re.findall(pattern, s): pass # do anything

This is just a sample code, and not my specific task. I just want to know how to loop through regex matches.

I hope my question is quite clear :-)

Best regards,

Markus

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,302RM Data Scientist

    Hi Markus,

    have a look at the attached process. It builds something like this with operators. It uses the new 7.4 loop. There is for sure a way to built this with 7.3 as well.

    ~马丁




    <输出/ >


    <运营商激活= " true "类="process" compatibility="7.4.000" expanded="true" name="Process">

    <运营商激活= " true "类="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document" width="90" x="112" y="238">


    <运营商激活= " true "类="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">





    <运营商激活= " true "类="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="136">





    <运营商激活= " true "类="append" compatibility="7.4.000" expanded="true" height="103" name="Append" width="90" x="246" y="34"/>
    <运营商激活= " true "类="concurrency:loop" compatibility="7.4.000" expanded="true" height="103" name="Loop" width="90" x="447" y="85">



    <运营商激活= " true "类="extract_macro" compatibility="7.4.000" expanded="true" height="68" name="Extract Macro" width="90" x="112" y="34">






    <运营商激活= " true "类="delay" compatibility="7.4.000" expanded="true" height="103" name="Delay" width="90" x="246" y="85">

    Execution Order

    <运营商激活= " true "类="text:extract_information" compatibility="7.4.001" expanded="true" height="68" name="Extract Information" width="90" x="447" y="136">

























    <运营商激活= " true "类="text:documents_to_data" compatibility="7.4.001" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="136">


    <连接from_op = "创建文档“from_port = " output" to_op="Loop" to_port="input 2"/>











    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • kaymankayman MemberPosts:662Unicorn

    You can use the replace dictionairy operator for this purpose.

    Easiest way to proceed is to create a csv containing the regex you want to use (the from atribute) and the replacement (the to atribute), tell the operator to use regular expressions and of you go. It will loop through the whole file and replaces content accordingly.

    MartinLiebig Thomas_Ott
  • markus_dresselmarkus_dressel MemberPosts:5Contributor I

    Hi,

    thank you for the quick response and provided solution. I have loaded your solution but maybe I have not correctly described my problem:

    Lets say, we have a document with the following text:

    Item here is some important text Item

    Here is no important text

    项目是一些额外的重要的文本条目

    If I will use the regex: "(?s)(?i)Item.*?Item" , I have two matches

    1: Item here is some important text Item

    2: Item here is some additional important text Item

    Seehttps://regex101.com/r/WYn2nm/1

    So the question is, how can I loop througheachmatch and do some stuff with it, keeping in mind that the amount of matches is not static in different documents.

    Something like that

    for match in regex.matches:
    if len(match) > 7:
    do stuff
    Else
    do other sutff

    Best regards and thank you for your great support

    Markus

  • kaymankayman MemberPosts:662Unicorn

    I see. As you stated you know how to do it in python so how about using an execute python process? You just create your regex script, pump your data through it and you are covered.

    Should be pretty simple this way, probably you can achieve it with plain RM vanilla but without having a clear idea on the data you have and what you want to achieve it's a bit complex to support.

    Something like this :





    <输出/ >


    <运营商激活= " true "类="process" compatibility="7.3.000" expanded="true" name="Process">

    <运营商激活= " true "类="python_scripting:execute_python" compatibility="7.2.000" expanded="true" height="82" name="regex_on_steroids" width="90" x="313" y="34">











Sign InorRegisterto comment.