"Web Crawler Crawling Rules [SOLVED]"

DatadudeDatadude MemberPosts:9Contributor II
edited June 2019 inHelp
I don't understand how the web crawling rules are working. I've been trying to scrape a particular site and I'm pulling set of listings from the site in order to parse them but getting the regular expressions/rules to work has been challenging.

The root of my search is the something like the following:

http://www.mysite.com/browse/division

What I'm trying to is pull down all the business site page which are found on the site. These page are found with the following format:

http://www.mysite.com/site/business-site-1

So...I'm am able to pull down all the pages with the following rules:







But the problem is that this casts too broad a net. I'm picking up links which have the following format:http://www.mysite.com /es/site/business-site-1. They're in Spanish so I don't want 'em. I don't know how to exclude. My latest attempt is the following:

http://www.mysite.com/browse/division.*"/>





But this doesn't work. The actual links in the source use relative links: /site/business-site-1. Is the Rapid Miner crawler resolving these links to absolute form? I've also tried fully realizing the absolute paths in the rules like so:

http://www.mysite.com/site/.*"/>

但这不是工作ither. Is there something going on here with the order of the rules themselves? Are the rules OR 'ed. I"m struggling a little here and the regular expressions seem to work fine out the Web Crawler context.
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Hi,

    on Rapid-I.com the process below is working perfectly. Maybe you have to include the absolute url also in the store rule?

    Best regards,
    Marius





    <宏/ >




    http://rapid-i.com"/>

    http://rapid-i.com/content/view/.*/1/lang,en/"/>
    http://rapid-i.com/content/view/.*/1/lang,en/"/>















  • DatadudeDatadude MemberPosts:9Contributor II
    Ok,

    最后算出来。它看起来像你可以ly have one rule of each type although that isn't very clear from the interface. You can use the matching groups functionality to find matching phrases in the urls which works well for my use case. I'm not even using the captured groups but this helps match up a "word" in the url. Here are my 2 ( and only two) revised rules

    http://www.mysite.com/(browse/site|site).*"/>
    http://wwwmysite.com/(browse/site|site).*"/>
Sign InorRegisterto comment.