"Crawling rules"

XannixXannix MemberPosts:21Maven
edited June 2019 inHelp
Hi,
I'm not sure if I don't understand the method but I don't know how to use the "store_with_matching_content" parameter.

I would like to store pages wich have one specific word (for example "euro"). I've tried to write:

a) Just the word: euro
b) A regular expression, for example: .*euro.*

What is the problem? Could someone explain me this?

Thanks : )
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2531年Unicorn
    Hi,
    you have to enter a valid regular expression.
    Please post the process, so that I can take a look at your parameters.

    Greetings,
    Sebastian
  • colocolo MemberPosts:236Maven
    I tried to use this rule some days ago without success. The other rules seem to work as expected, but there might be a an issue with matching the regular expression for store_with_matching_content. I entered several expressions and even .* didn't bring up any results. Does this problem derive from usage or from a little bug?;)
  • XannixXannix MemberPosts:21Maven
    Hi colo,
    I have the same problem, all the other rules work fine, but not this. Here is my example, crawling Rapid-i web:











    http://rapid-i.com/index.php?lang=en"/>

    http://rapid-i\.com/.*"/>





    <连接from_op = "爬网”from_port="Example Set" to_port="result 1"/>







  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2531年Unicorn
    Hi,
    what exactly happens with this rule? Does the operator always return an empty set or doesn't it finish work at all?

    Greetings,
    Sebastian
  • colocolo MemberPosts:236Maven
    Hello Sebastian,

    it doesn't even result in an empty set. There simply are no results, after finishing the process the prompt for switching to results perspective shows up as usual. But there is only the empty result overview and nothing else...

    Regards,
    Matthias
  • haddockhaddock MemberPosts:849Maven
    Greets to all,

    Well, it is actually possible to get something from the webcrawler - the code below makes word vectors of the recent posts in this forum - but if you want to mine more than a few pages I'm not sure the websphinx library is that robust. The last version was released in 2002. Furthermore if I insert print statements in appropriate places and build the operators from scratch I can see results that are, shall we say, intriguing. Anyways, here's the creepy crawler...










    http://rapid-i.com/rapidforum/index.php?action=recent"/>

    http://rapid-i.com/rapidforum.*"/>
    http://rapid-i.com/rapidforum.*"/>



















    <连接from_op = from_port =“文档”t“标记”o_op="Filter Stopwords (English)" to_port="document"/>






    <连接from_op = "爬网”from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>






    Par contre, if I use a RetrievePagesOperator on the output from an RSS Feed operator all works fine.


    Toodles


  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2531年Unicorn
    Hi,
    I switched the regular expression to dotall mode, so that . also replaces line breaks. This solves the issue that the regular expression doesn't match the document, but takes far tooooooo long time for building a regular expression of a website with 120kb. I think we will have to bury this option in the current incarnation.
    Any idea how to replace it, beside simply switching to string matching anyway?

    Greetings,
    Sebastian

    PS:
    If anybody knows another, powerful open-source web crawler, that's usable from java: I would be gladly to replace that "creepy" sphinx.

  • haddockhaddock MemberPosts:849Maven
    Greets Seb,

    I'm cannibalising the sphinx at the moment, and working on tokens rather than strings, as well as using the header fields, description, keywords, etc., which are regex friendly, and can be pre-fetched. I've also started looking at Heretrix. Something may emerge;)

    Ciao
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2531年Unicorn
    Hi,
    thanks for the hint on Heritrix. This really seams worth the effort. Uhm, now I only need somebody to pay me for implementing this. Any volunteers?:)
    Does anybody have negative experience with this crawler? Otherwise I will add it to the feature request list.

    Greetings,
    Sebastian
  • XannixXannix MemberPosts:21Maven
    So... uhmm..., isn't posible to crawl with "store_with_matching_content" parameter?

    Actually, I do in this way:

    [1] Crawl Web ->
    [2] Generate Extract ->
    [3] Filter Examples

    [1]: I don't use "store_with_matching_content"
    [2]: I extract text with xPath because the parameter "attribute_value_filter" of the "Filter Examples" operator doesn't work if it find any html tag. Is that normal or not?
    [3]: I select the only examples which match content

    I know that works fine, but I think that is not efficient...

    Any idea?

    Thanks : ))
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2531年Unicorn
    Hi,
    this depends on the regular expression used, but I guess you will have to switch to dotall mode, because normally there's a linebreak behind the tag and per default the . character does not include line breaks.

    Greetings,
    Sebastian
  • XannixXannix MemberPosts:21Maven
    Hi,
    where can I find the "dotall mode" option?

    Thanks
  • XannixXannix MemberPosts:21Maven
    Sorry, I realized I was wrong...

    I've been testing again, if you want to find the word "Euro" in the content you can write:

    [\S\s]*Euro[\S\s]*

    maybe is a little slow, but it works.

    Thanks for all : )
  • colocolo MemberPosts:236Maven
    Hello Xannix,

    if you want to use options/modifications in your expressions you can simply use them by (?x) in your regex. The "x" specifies which option to use, for the "dotall"-option this would be "s". I think it's an easy and clean way to set all options at the beginning of your regex. For your "Euro" seach it would read as follows:

    (?s).*Euro.*
  • XannixXannix MemberPosts:21Maven
    Hi, Colo, thanks, I'll try it : )
Sign InorRegisterto comment.