"Crawling rules"
Hi,
I'm not sure if I don't understand the method but I don't know how to use the "store_with_matching_content" parameter.
I would like to store pages wich have one specific word (for example "euro"). I've tried to write:
a) Just the word: euro
b) A regular expression, for example: .*euro.*
What is the problem? Could someone explain me this?
Thanks : )
I'm not sure if I don't understand the method but I don't know how to use the "store_with_matching_content" parameter.
I would like to store pages wich have one specific word (for example "euro"). I've tried to write:
a) Just the word: euro
b) A regular expression, for example: .*euro.*
What is the problem? Could someone explain me this?
Thanks : )
Tagged:
0
Answers
you have to enter a valid regular expression.
Please post the process, so that I can take a look at your parameters.
Greetings,
Sebastian
I have the same problem, all the other rules work fine, but not this. Here is my example, crawling Rapid-i web:
what exactly happens with this rule? Does the operator always return an empty set or doesn't it finish work at all?
Greetings,
Sebastian
it doesn't even result in an empty set. There simply are no results, after finishing the process the prompt for switching to results perspective shows up as usual. But there is only the empty result overview and nothing else...
Regards,
Matthias
Well, it is actually possible to get something from the webcrawler - the code below makes word vectors of the recent posts in this forum - but if you want to mine more than a few pages I'm not sure the websphinx library is that robust. The last version was released in 2002. Furthermore if I insert print statements in appropriate places and build the operators from scratch I can see results that are, shall we say, intriguing. Anyways, here's the creepy crawler... Par contre, if I use a RetrievePagesOperator on the output from an RSS Feed operator all works fine.
Toodles
I switched the regular expression to dotall mode, so that . also replaces line breaks. This solves the issue that the regular expression doesn't match the document, but takes far tooooooo long time for building a regular expression of a website with 120kb. I think we will have to bury this option in the current incarnation.
Any idea how to replace it, beside simply switching to string matching anyway?
Greetings,
Sebastian
PS:
If anybody knows another, powerful open-source web crawler, that's usable from java: I would be gladly to replace that "creepy" sphinx.
I'm cannibalising the sphinx at the moment, and working on tokens rather than strings, as well as using the header fields, description, keywords, etc., which are regex friendly, and can be pre-fetched. I've also started looking at Heretrix. Something may emerge
Ciao
thanks for the hint on Heritrix. This really seams worth the effort. Uhm, now I only need somebody to pay me for implementing this. Any volunteers?
Does anybody have negative experience with this crawler? Otherwise I will add it to the feature request list.
Greetings,
Sebastian
Actually, I do in this way:
[1] Crawl Web ->
[2] Generate Extract ->
[3] Filter Examples
[1]: I don't use "store_with_matching_content"
[2]: I extract text with xPath because the parameter "attribute_value_filter" of the "Filter Examples" operator doesn't work if it find any html tag. Is that normal or not?
[3]: I select the only examples which match content
I know that works fine, but I think that is not efficient...
Any idea?
Thanks : ))
this depends on the regular expression used, but I guess you will have to switch to dotall mode, because normally there's a linebreak behind the tag and per default the . character does not include line breaks.
Greetings,
Sebastian
where can I find the "dotall mode" option?
Thanks
I've been testing again, if you want to find the word "Euro" in the content you can write:
[\S\s]*Euro[\S\s]*
maybe is a little slow, but it works.
Thanks for all : )
if you want to use options/modifications in your expressions you can simply use them by (?x) in your regex. The "x" specifies which option to use, for the "dotall"-option this would be "s". I think it's an easy and clean way to set all options at the beginning of your regex. For your "Euro" seach it would read as follows:
(?s).*Euro.*