"Using Regex in the web crawler"

guitarslingerguitarslinger MemberPosts:12Contributor II
edited June 2019 inHelp
Hi there,

I am struggling with the setup of the crawlers in the web mining extension:

I can't figure out how to set the crawling rules so that the crawler produces any results.
Leaving the rules empty does not work either.

Can I find an example for crawling rules somewhere?

Thx in advance

GS
Tagged:

Answers

  • B_MinerB_Miner MemberPosts:72Maven
    Post what you are trying to do (XML) and description. Maybe someone can help. I used it successfully, but again are not sure your aim
  • guitarslingerguitarslinger MemberPosts:12Contributor II
    Hi B_Miner, good point:

    Here ist the XML, just having the crawler connected to the main process and having two rules:
    1. follow every link ".*"
    2. store every page ".*"










    <宏/ >




    http://www.aol.com"/>

















  • guitarslingerguitarslinger MemberPosts:12Contributor II
    Problem solved: I had no value in parameter "max. pages".

    I thought this parameter is optional, leaving it blank will just not limit the number of pages, but actually without any value it does not crawl at all.

    Works now, I am happy!

    问候GS
    ;D
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2,531Unicorn
    好吧,
    it should be optional. ****. I will make sure, it's optional in future:)
    Good thing you got it to work, though.

    Greetings,
    Sebastian
Sign InorRegisterto comment.