Read Document Error & Skipping Over Errors

carlcarl MemberPosts:30Guru
edited November 2018 inHelp

I get the following error from the Read Document operator (inside Loop Examples after Read Excel with the input URLs). It stops after successully reading several hundred records. I have a log that tells me where the process stops, but do not see anything obviously wrong with the input URL.

Any thoughts on the possible cause? And is there a way to skip past any troublesome input URLs rather than stopping the process with no output?

Error.jpg







<操作符r activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">

<操作符r activated="true" class="read_excel" compatibility="7.3.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">











<操作符r activated="true" class="loop_examples" compatibility="7.3.000" expanded="true" height="103" name="Loop Examples" width="90" x="179" y="34">

<操作符r activated="true" class="extract_macro" compatibility="7.3.000" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="136">






<操作符r activated="true" class="log" compatibility="7.3.000" expanded="true" height="82" name="Log" width="90" x="179" y="136">





<操作符r activated="true" class="open_file" compatibility="7.3.000" expanded="true" height="68" name="Open File" width="90" x="246" y="34">

<参数键= " url " value = " % {GetURL} " / >

<操作符r activated="true" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="380" y="34">



<操作符r activated="true" class="text:extract_information" compatibility="7.3.000" expanded="true" height="68" name="Extract Information" width="90" x="514" y="34">












<操作符r activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="648" y="34">


<操作符= " true " class = " generate_attribu激活tes" compatibility="7.3.000" expanded="true" height="82" name="Generate Attributes" width="90" x="782" y="34">







<连接from_op = from_port =“阅读文档输出" to_op="Extract Information" to_port="document"/>









<操作符r activated="true" class="append" compatibility="7.3.000" expanded="true" height="82" name="Append" width="90" x="313" y="34"/>
<操作符= " true " class = " select_attribute激活s" compatibility="7.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="34">




<操作符r activated="true" class="filter_examples" compatibility="7.3.000" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="34">






<操作符r activated="true" class="order_attributes" compatibility="7.3.000" expanded="true" height="82" name="Reorder Attributes" width="90" x="715" y="34">


<操作符r activated="true" class="write_excel" compatibility="7.3.000" expanded="true" height="82" name="Write Excel" width="90" x="849" y="34">













Best Answer

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    Solution Accepted

    hi...ok I've looked at your process. Some thoughts..

    - Are all the URLs that you're going point to PDF files? Your Read Document operator is only looking for pdfs.

    - I tend not to use the Open File operator to get a web page. I prefer to use the "Get Page" operator in the Web Mining extension. There's a lot more functionality there.

    - That yellow text warning is what you want. It's telling you that Handle Exception is skipping over the operator "Read Document" when it cannot do it. If it were me, I would put both the Open File and the Read Document in the "Try" section.

    - That red text warning is telling you that whatever succeeds in the Handle Exception and is being passed onto Extract Information is not always a document, and hence it gives you an error (Extract Information requires a document).

    SO if it were me, I would try the following:

    - Place ALL the operators inside the Loop Examples inside the Handle Exception. This way it skips over any problems it has along the way, and only passes complete successes to the output.

    - Rebuild the URL grab using Get Page rather than Open File.

    Scott

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    for skipping over errors, I would recommend the "Handle Exception" operator. It's very handy.

    Scott

  • carlcarl MemberPosts:30Guru

    Thank you Scott. It feels like this approach should get me there. I can't quite implement it correctly though.

    I've copied the approach in the tutorial example for the operator. When I run it,Handle exceptionscycles through the good URLs, and logs the bad one, but then theExtract Informationoperator (followingHandle Exceptions) gives me this error.

    Dec 14, 2016 7:25:45 PM WARNING: Error occurred and will be neglected by Handle Exception: Could not read file 'InputFileObject': java.io.IOException: javax.crypto.BadPaddingException: Given final block not properly padded.
    Dec 14, 2016 7:25:45 PM SEVERE: Process failed: Wrong input of type 'File' at port 'document'. Expected type 'Document'.

    I triedCreate Document(after the log) on the catch side of Handle Exceptions. But that just moves the problem to theAppendoperator. I don't really need to do any more than log the error, then proceed with the good URLs, but can't quite find a formulation to get me there. Could you point me in the right direction?








    <操作符r activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">

    <操作符r activated="true" class="read_excel" compatibility="7.3.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">














    <操作符r activated="true" class="loop_examples" compatibility="7.3.001" expanded="true" height="103" name="Loop Examples" width="90" x="179" y="34">

    <操作符r activated="true" class="extract_macro" compatibility="7.3.001" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="136">










    <操作符r activated="true" class="log" compatibility="7.3.001" expanded="true" height="82" name="Log" width="90" x="179" y="136">





    <操作符r activated="true" class="open_file" compatibility="7.3.001" expanded="true" height="68" name="Open File" width="90" x="246" y="34">

    <参数键= " url " value = " % {GetURL} " / >

    <操作符r activated="true" class="handle_exception" compatibility="7.3.001" expanded="true" height="82" name="Handle Exception" width="90" x="380" y="34">

    <操作符r activated="true" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="179" y="34">




    <连接from_op = from_port =“阅读文档输出" to_port="out 1"/>






    <操作符r activated="true" class="log" compatibility="7.3.001" expanded="true" height="82" name="Log (2)" width="90" x="179" y="34">













    <操作符r activated="true" class="text:extract_information" compatibility="7.3.000" expanded="true" height="68" name="Extract Information" width="90" x="514" y="34">












    <操作符r activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="648" y="34">


    <操作符= " true " class = " generate_attribu激活tes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="782" y="34">




















    <操作符r activated="true" class="append" compatibility="7.3.001" expanded="true" height="82" name="Append" width="90" x="313" y="34"/>
    <操作符= " true " class = " select_attribute激活s" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="34">




    <操作符r activated="true" class="filter_examples" compatibility="7.3.001" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="34">






    <操作符r activated="true" class="order_attributes" compatibility="7.3.001" expanded="true" height="82" name="Reorder Attributes" width="90" x="715" y="34">


    <操作符r activated="true" class="write_excel" compatibility="7.3.001" expanded="true" height="82" name="Write Excel" width="90" x="849" y="34">
















  • carlcarl MemberPosts:30Guru

    Perfect, thank you. That worked.

    Yes, only after PDFs. There was at least one ODT, but I filtered those out as I couldn't see an operator to handle those.

    sgenzer
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    oh good glad it worked.

    Yes I don't know of a way to pull in .doc, .docx, .odt, etc... nicely. Maybe there's an API that you can use to convert to pdf or text? Otherwise submit to "Ideas".:)

    Scott

Sign InorRegisterto comment.