[SOLVED] Crawl Web and generate reporting

pemguinkplpemguinkpl MemberPosts:14Contributor II
edited October 2019 inHelp
Hi,

i have try the crawl web process, but the result showed no have any document i have crawled. May i know what is the problem?
I follow exactly the step from the video below, but encounter the problem.

http://www.youtube.com/watch?v=zMyrw0HsREg

Any help please... :-\

How to use the generate report n report operation in rapid miner?
Anyone know???

Thank You!
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Hi,

    I didn't watch the video and don't have the time to. Could you please post your process and describe more specifically what you are trying to do?


    Best regards,
    Marius
  • pemguinkplpemguinkpl MemberPosts:14Contributor II
    hi marius thanks for replied,

    my initially research is to analyze H1N1 news and using crawler to get all the news about h1n1. This is the link i try to crawl

    http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00%2B08:00&;max-results=20

    但是我不能得到任何文档。

    This is my process xml:










    http://my-h1n1.blogspot.com/search/label/news?updated-max=2009-07-26T02:03:00+08:00&max-results=20"/>




















































    May i know what is the problem? Thanks=)

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Hi,

    问题是,你想cra的页面wl does not allow to be crawled, and of course RapidMiner obeys this exclusion by default. The crawl operator has to options to ignore the so called robot exclusion, but as it says in the documentation, you are usually not allowed to disable it for pages which are not your own. These are the parameters:

    obey robot exclusion: Specifies whether the crawler obeys the rules, which pages on site might be visited by a robot. Disable only if you know what you are doing and if you a sure not to violate any existing laws by doing so. Range: boolean; default: true
    really ignore exclusion: Do you really want to ignore the robot exclusion? This might be illegal. Range: boolean; default: false

    Best,
    Marius
  • pemguinkplpemguinkpl MemberPosts:14Contributor II
    HI marius,

    thank you for the replies, it's solved my problem;)

Sign InorRegisterto comment.