"Complex Web Crawling Process With Sessions"
Hi there,
I'm trying to crawl jopposting from this site:http://jobboerse.arbeitsagentur.de/vamJB/startseite.html
I have to get apprenticeship posting from a certain region. Unfortunately the whole page makes use of sessions. So far I'm using the getpages operator to get a list of all apprenticeship posting from a certain region which is spread up into 7 pages with about 50 postings per page.
At the moment I'm trying to get all the posting links from each of the 7 pages and request each posting detail page. My guess is to use Extract Information operator to get the links but still trying to figure out the correct xpath queries to get the 50 posting detail links. I already get the first posting detail link but need some king of iteration enumeration for the rest. Any Ideas?
Also this process is gonna be very complex. Any hint of how to make it simple is welcome. The problem is I can only request one URL per get pages operator to keep the session.
[glow=red,2,300]Thanks in advance [/glow]
I'm trying to crawl jopposting from this site:http://jobboerse.arbeitsagentur.de/vamJB/startseite.html
I have to get apprenticeship posting from a certain region. Unfortunately the whole page makes use of sessions. So far I'm using the getpages operator to get a list of all apprenticeship posting from a certain region which is spread up into 7 pages with about 50 postings per page.
At the moment I'm trying to get all the posting links from each of the 7 pages and request each posting detail page. My guess is to use Extract Information operator to get the links but still trying to figure out the correct xpath queries to get the 50 posting detail links. I already get the first posting detail link but need some king of iteration enumeration for the rest. Any Ideas?
Also this process is gonna be very complex. Any hint of how to make it simple is welcome. The problem is I can only request one URL per get pages operator to keep the session.
[glow=red,2,300]Thanks in advance [/glow]
<运营商激活d="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
<运营商激活d="true" class="read_csv" compatibility="5.3.005" expanded="true" height="60" name="Read CSV" width="90" x="380" y="165">
<运营商激活d="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (2)" width="90" x="514" y="165">
<运营商激活d="true" class="select_attributes" compatibility="5.3.005" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="255">
<运营商激活d="true" class="rename" compatibility="5.3.005" expanded="true" height="76" name="Rename" width="90" x="715" y="255">
<运营商激活d="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes" width="90" x="849" y="210">
<运营商激活d="true" class="multiply" compatibility="5.3.005" expanded="true" height="112" name="Multiply" width="90" x="983" y="210"/>
<运营商激活d="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes (3)" width="90" x="1184" y="390">
<运营商激活d="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="1184" y="255">
<运营商激活d="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (5)" width="90" x="1385" y="390">
<运营商激活d="true" class="text:extract_document" compatibility="5.3.000" expanded="true" height="76" name="Extract Document" width="90" x="1519" y="390">
<参数键= " attribute_name”lue="gensym2"/>
<运营商激活d="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="1653" y="390">
<运营商激活d="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information (2)" width="90" x="313" y="30">
@title='Zu den Details des Stellenangebots']"/>
@title='Zu den Details des Stellenangebots']"/>
@title='Zu den Details des Stellenangebots' and 4]"/>
<参数键=“o5”值= " / *[名称()= ' html '] / *[南e()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/[2][name()='a' and@title='Zu den Details des Stellenangebots']"/>
@title='Zu den Details des Stellenangebots']"/>
<运营商激活d="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (4)" width="90" x="1318" y="255">
<运营商激活d="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (3)" width="90" x="1117" y="120">
Tagged:
0
Answers
Now all operator should use the same cookies including the session cookies.