"Cut Document II Crawling"

FlakeFlake MemberPosts:13Contributor II
edited June 2019 inHelp
hi there, I did notice there is another post about cutting document raised by Roberto and answered by Matthias.

However, first part of my problem is a little bit different from that post, but I believe it is an even easier one for the people who know how to solve it.

Questions:

1. I will retrieve a web page, e.g. Terms of service page of Google. I want to put eachparagraph生在输出excel。我不熟悉with regular expression kind of things, please help me here.

2. Does RM support to crawl the Internet, say, finding hundreds of pages returned by search keyword "Terms of Service"?

Thanks in advance.

Answers

  • colocolo MemberPosts:236Maven
    Hi Flake,

    let's see if I can answer the second cut document topic as well;)

    If you want to get each paragraph (or some other HTML element) out of a website, I would probably prefer using XPath rather than writing regular expressions. The expression //h:p will find every paragraph at any depth (h is the default namespace for HTML elements):







    <运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">

    <运营商激活= " true " class = "web:get_webpage" compatibility="5.1.002" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
    http://microsoft.com"/>



    <运营商激活= " true " class = "text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
















    <运营商激活= " true " class = "text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="313" y="30">











    RapidMiner provides the "Crawl Web" operator for crawling but this is very slow when checking keywords within the document content. Perhaps some alternative crawlers (e.g. HTTRACK, Heritrix) will perform much better. Maybe someday an advanced crawler will replace the current implementation. There are one or two older topics with discussions about this.

    Regards
    Matthias

    P.S. Please consider posting questions like this in the "Problems and Support Forum". In my opinion the forum's description is closer to many of the topics created here.
  • FlakeFlake MemberPosts:13Contributor II
    Dear Matthias,

    Many thanks for your help! It works for my purpose with few simple tweaks.:)

    Below is my process. Actually what I added are the things to remove the HTML tag sort of things and extract only the texts. But I run into problems such as several empty rows are generated due to my solution. Then, I had to add another Remove Duplicate operator to remove them.

    However, 'cause I am learning to use RM, I believe I didn't do it in the best way.

    If you are interested, could you give some suggestions on how to improve here?







    <运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">

    <运营商激活= " true " class = "web:get_webpage" compatibility="5.1.003" expanded="true" height="60" name="Get Page" width="90" x="45" y="120">
    http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Copyright/Default.aspx"/>



    <运营商激活= " true " class = "text:cut_document" compatibility="5.1.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="120">

















    <运营商激活= " true " class = "text:process_documents" compatibility="5.1.002" expanded="true" height="94" name="Process Documents" width="90" x="514" y="120">




    <运营商激活= " true " class = "web:extract_html_text_content" compatibility="5.1.003" expanded="true" height="60" name="Extract Content" width="90" x="447" y="30"/>







    <运营商激活= " true " class = "remove_duplicates" compatibility="5.1.011" expanded="true" height="76" name="Remove Duplicates" width="90" x="648" y="120">



    <运营商激活= " true " class = "generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="782" y="120"/>
    <运营商激活= " true " class = "write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="983" y="210">














  • colocolo MemberPosts:236Maven
    Hi Flake,

    this looks good to me. I would probably prefer "Filter Examples" to get rid of the empty rows instead of using "Remove Duplicates", but this isn't really important.







    <运营商激活= " true " class = "process" compatibility="5.1.011" expanded="true" name="Process">

    <运营商激活= " true " class = "web:get_webpage" compatibility="5.1.002" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
    http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Copyright/Default.aspx"/>



    <运营商激活= " true " class = "text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">

















    <运营商激活= " true " class = "text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">



    <过程扩展= " true "高度= " 589 "宽度= " 567 " >
    <运营商激活= " true " class = "web:extract_html_text_content" compatibility="5.1.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>







    <运营商激活= " true " class = "filter_examples" compatibility="5.1.011" expanded="true" height="76" name="Filter Examples" width="90" x="447" y="120">







    <运营商激活= " true " class = "generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="581" y="30"/>
    <运营商激活= " true " class = "write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="715" y="30">














    Since you are using more than one cut expression for the "Cut Document" operator, you may perhaps want to know where an example came from. If you are interested in this, you can activate "add meta data" for "Process Documents" and identify the source by looking at the attribute query_key (lots of the other attributes can be filtered out by using "Select Attributes"). If you don't need this information you're already fine.

    You have some possibilities for changing operator chaining a bit (e.g. put the HTML removal inside "Cut Document", putting "Cut Document" inside "Process Documents", etc.) but this doesn't really change anything. If I had created such a process this would probably look the same.

    Regards
    Matthias

Sign InorRegisterto comment.