"issues with web crawling [UPDATED]"
UPDATE :
It seems that below scenario is common for any website that uses query parameters in the url.
In other words, any URL that looks like 'http://domain.com/some_blabla?param1=something¶m2=somethingelse'is not crawled.
Some examples :
https://www.worten.pt/inicio/imagem-e-som/tv.html-> no problem
https://www.worten.pt/inicio/imagem-e-som/tv.html?p=3-> not crawled when using above page as starting point, works fine if entered directly
these are my crawling rules, pretty basic :
same problem with this site :https://www.worten.pt/inicio/imagem-e-som/tv.html"/>
<参数键= " follow_link_with_matching_url" value=".*imagem-e-som/tv.html.*"/>
http://www.fnac.com/Tous-les-televiseurs/Televiseur/nsh75822/w-4-> no problem
http://www.fnac.com/Tous-les-televiseurs/Televiseur/nsh75822/w-4?PageIndex=3#3-> won't get crawled from above page, ok if entered directly
http://www.fnac.com/Tous-les-televiseurs/Televiseur/nsh75822/w-4"/>
http://www.fnac.com/Tous-les-televiseurs/Televiseur/.*"/>
<参数键= " follow_link_with_matching_url" value="http://www.fnac.com/Tous-les-televiseurs/Televiseur/.*"/>
Same goes for below full example :
https://www.otto.de/multimedia/fernseher/led-fernseher/-> no problem
https://www.otto.de/multimedia/fernseher/led-fernseher/?p=2&;ps=30 -> not being crawled from above link, ok if entered directly
I'm using similar logic on different sites taht all work fine, but as soon as a question mark appears in the url the logic is broken
Is this a bug, or am I overlooking something ? Is the same issue seen with version 6 ?
[ORIGINAL QUESTION]
I'm creating a process to compare prices from different retailer, for most of these it works fine but some are really driving me nuts when it comes to following links.
This is an example :
I can only get one page, being the base urlhttps://www.otto.de/multimedia/fernseher/led-fernseher/
<宏/ >https://www.otto.de/multimedia/fernseher/led-fernseher/"/>
<参数键= " follow_link_with_matching_url" value=".*/led-fernseher/.*"/>
On the page though there are links using following standard :https://www.otto.de/multimedia/fernseher/led-fernseher/?p=2&;ps=30 but I can't get them crawled. I've tried several regex patterns, all of them are matched when testing but the page will not get crawled. Any idea what the problem could be ?
Thanks in advance !
Tagged:
0