"WEB crawler rules"

keops9876 · June 2013

Hi!

I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.

I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.

I know that there are 2 rules important: what to follow and what to store.

I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to storehttp://www.realestate-slovenia.info/nepremicnine.html?id=5725280

What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+

Any help would be apreciated!

U.

MariusHelf · June 2013

Hey U,

在快速检查我有一些与followi页面ng settings:
url:http://www.realestate-slovenia.info/
both rules: .+id.+

And I also increased the max page size to 10000.

As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?

Best regards,
Marius

keops9876 · June 2013

Marius,

the web page allows robots.

Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.

Tnx for helping.

MariusHelf · June 2013

Then you probably have to increase the max_depth and adapt your rules. Please note that you should not add more than one follow rule, but instead add all expressions to one single rule, separated by a vertical bar as you have done in your first post.

Best regards,
Marius

keops9876 · July 2013

Marius,

I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.

This is my Web crawler process:

<宏/ >

http://www.realestate-slovenia.info/nepremicnine.html?q=sale"/>

http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale[&]pg=.+ | id=.+)"/>
http://www.realestate-slovenia.info/nepremicnine.html?id=.+"/>

As you can see I try to follow 3 types of URL, for example

http://www.realestate-slovenia.info/nepremicnine.html?q=sale
http://www.realestate-slovenia.info/nepremicnine.html?q=sale&;pg=6
http://www.realestate-slovenia.info/nepremicnine.html?id=5744923

我想商店one type of URL

http://www.realestate-slovenia.info/nepremicnine.html?id=5469846

So for the first task my rule is

http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale&pg=.+ | id=.+)

Fpr the second task rule is:
http://www.nepremicnine.net/nepremicnine.html?id=.+

Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
.+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.

I would really like this process to work cause gathered data are the basis for my article.

Tnx.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"WEB crawler rules"

Answers