Problem with collecting specific information using RegEx

lukei_11 · February 2018

Hey RapidMiner community,

I have a problem with the use of a RegEx:

I'd like to collect information about the adress of different institutions and companies. For this reason I use the crawl web operator and collect the sites that have the adress information on them. This step is working perfectly. In the next step I want to retrieve the street and the Zipcode + city. For that I use the following RegEx in the "Extract Information" operator:

(.+\s)((D|d|DE)?\-?[6-7][0-9]{4}\s[A-Z][a-z]{1,})

With this RegEx I'd like to collect following:

For example from this sitehttp://www.vfb.de/de/1893/club/service/formales/impressum/

I want "Mercedesstraße 109" and "70372 Stuttgart" as the result.

For the part with the Zipcode (starting with either the number 6 or 7) and the name of the city it is working. Because of that I want to look for the line above that. But as soon as I add the first part (.+\s) to collect the line above the Zipcode and city, the result in the result-section of my process is just a ? (Questionmark). Is there any mistake in my RegEx or does RapidMiner require a special format? Because when I test my RegEx in a free online RegEx-Tester it is working properly...

Thank you!

lukei_11

BalazsBarany · February 2018

Hi!

There are special cases for multiline regexes. Make sure to use an online regular expression tester that lets you use Java syntax, as that's what RapidMiner uses. Best is to use the tester in RapidMiner, which is e. g. available in the Replace operator.

Here's an example process:








< process expanded="true">


< parameter key="daten" value=""VfB Stuttgart 1893 AG
VfB Stuttgart 1893 AG
Mercedesstraße 109
70372 Stuttgart""/>




< parameter key="attribute_filter_type" value="single"/>
< parameter key="attribute" value="daten"/>
< parameter key="replace_what" value="(.+\s)((D|d|DE)?\-?[6-7][0-9]{4}\s[A-Z][a-z]{1,})"/>



< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>

Click on the Replace operator and then on the small button on the right side of "replace what". You'll see the built-in regular expression tester. Paste your example text, the matched part will be highlighted. Here you can play with the expression until it does what you want.

Regular expressions are not the best method for this, though, if your output is not "regular".

Regards,

Balázs

lukei_11 · February 2018

Dear Balazs,

thank you for your quick response! I tried your solution but it isn't working... When I test my RegEx in the testing environment it matches what it should but when I run my process it only returns a ? (Questionmark) as the result.

RegEx problem.PNG in the test environment the RegEx matches the right parts

在我的过程中有什么错误?








< parameter key="logverbosity" value="init"/>
< parameter key="random_seed" value="2001"/>
< parameter key="send_mail" value="never"/>
< parameter key="notification_email" value=""/>
< parameter key="process_duration_for_mail" value="30"/>
< parameter key="encoding" value="SYSTEM"/>
< process expanded="true">

< parameter key="url" value="https://www.vfb.de/"/>

< parameter key="follow_link_with_matching_url" value=".+impressum(.*)?"/>
< parameter key="store_with_matching_url" value=".+impressum(.*)?"/>

< parameter key="max_crawl_depth" value="1"/>
< parameter key="retrieve_as_html" value="true"/>
< parameter key="enable_basic_auth" value="false"/>
< parameter key="add_content_as_attribute" value="true"/>
< parameter key="write_pages_to_disk" value="false"/>
< parameter key="include_binary_content" value="false"/>
< parameter key="output_dir" value="C:\Users\lukei\Desktop\Impressum"/>
< parameter key="output_file_extension" value="html"/>
< parameter key="max_pages" value="100"/>
< parameter key="max_page_size" value="1000"/>
< parameter key="delay" value="200"/>
< parameter key="max_concurrent_connections" value="100"/>
< parameter key="max_connections_per_host" value="50"/>
< parameter key="user_agent" value="rapidminer-web-mining-extension-crawler"/>
< parameter key="ignore_robot_exclusion" value="false"/>


< parameter key="create_word_vector" value="false"/>
< parameter key="vector_creation" value="TF-IDF"/>
< parameter key="add_meta_information" value="true"/>
< parameter key="keep_text" value="false"/>
< parameter key="prune_method" value="none"/>
< parameter key="prune_below_percent" value="3.0"/>
< parameter key="prune_above_percent" value="30.0"/>
< parameter key="prune_below_rank" value="0.05"/>
< parameter key="prune_above_rank" value="0.95"/>
< parameter key="datamanagement" value="double_sparse_array"/>
<参数键= " data_management " value = "自动" / >
< parameter key="select_attributes_and_weights" value="false"/>

< process expanded="true">

< parameter key="query_type" value="Regular Expression"/>

< parameter key="attribute_type" value="Nominal"/>

< parameter key="PLZ und Ort" value="([6-7][0-9]{4}\s[A-Z][a-z]{1,})"/>
< parameter key="PLZ" value="[6-7][0-9]{4}\s"/>
< parameter key="Strasse" value="(.+\s)((D|d|DE)?\-?[6-7][0-9]{4}\s[A-Z][a-z]{1,})"/>




< parameter key="ignore_CDATA" value="true"/>
< parameter key="assume_html" value="true"/>





< portSpacing port="source_document" spacing="0"/>
< portSpacing port="sink_document 1" spacing="0"/>
< portSpacing port="sink_document 2" spacing="0"/>




< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>
 YYY ABC

You mentioned that this is possibly not the best way of doing it... Do you have another idea how to automated collect the contact information and adresses of soccer clubs (and there are many soccer clubs in Germany...)?

Thank you!

BalazsBarany · February 2018

Hi!

You're searching for only characters and numbers in the street. However, the web site contains: Mercedesstraße 109. The ß is expressed as an HTML entity.

The text you're testing is not what's coming out of the crawler operator. Your process is set up to return the HTML code.

This approach doesn't work well because you start on a few sites, tune your regexp to detect the addresses there, then you encounter additional sites with a different format, you tune the regexp more, then it doesn't work on the original site anymore, or gives you too many false hits etc.

This kind of processing is very hard. Google is trying to do it and even for them it sometimes fails if somebody was very creative when writing the address.

Your best bet is to find a structured listing. Maybe on Wikipedia? Wikidata? The DFB?

Regards,

Balázs

sgenzer · February 2018

have you tried using the Data Search extension? Your example looks very very similar to the one used by@eyin the tutorial.

Scott

ey · February 2018

Hilukei_11,

Please try out the Read HTML Tables operator from the Web Table Extraction extension.

Best Wishes,

Edwin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Problem with collecting specific information using RegEx

Answers