"WEB crawler rules"
Hi!
I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.
I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.
I know that there are 2 rules important: what to follow and what to store.
I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to storehttp://www.realestate-slovenia.info/nepremicnine.html?id=5725280
What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+
Any help would be apreciated!
U.
I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.
I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.
I know that there are 2 rules important: what to follow and what to store.
I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to storehttp://www.realestate-slovenia.info/nepremicnine.html?id=5725280
What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+
Any help would be apreciated!
U.
Tagged:
0
Answers
在快速检查我有一些与followi页面ng settings:
url:http://www.realestate-slovenia.info/
both rules: .+id.+
And I also increased the max page size to 10000.
As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?
Best regards,
Marius
the web page allows robots.
Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.
Tnx for helping.
Best regards,
Marius
I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.
This is my Web crawler process:
<宏/ >
As you can see I try to follow 3 types of URL, for example
http://www.realestate-slovenia.info/nepremicnine.html?q=sale
http://www.realestate-slovenia.info/nepremicnine.html?q=sale&;pg=6
http://www.realestate-slovenia.info/nepremicnine.html?id=5744923
我想商店one type of URL
http://www.realestate-slovenia.info/nepremicnine.html?id=5469846
So for the first task my rule is
http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale&pg=.+ | id=.+)
Fpr the second task rule is:
http://www.nepremicnine.net/nepremicnine.html?id=.+
Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
.+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.
I would really like this process to work cause gathered data are the basis for my article.
Tnx.