Processing PDF documents for text mining with the Process Documents from Files operator

jan_spoererjan_spoerer MemberPosts:10Contributor I
edited December 2018 inHelp

I tried processing large PDF documents using the Process Documents from Files operator. When running the process, RapidMiner returns an error while processing the Process Documents from Files operator. The error message is: "Process failed. javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher."

According to Marco Böck's post inthisthread, the operator should be able to process PDF documents by now, if I understood him correctly.

Is there a way to process PDF documents without any workarounds? Any hints are highly appreciated. Thank you!

Error.JPG 49.6K

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn
    Solution Accepted

    Do any of these PDF's contain encryption? I think that error refers to this.

    sgenzer

Answers

  • jan_spoererjan_spoerer MemberPosts:10Contributor I

    Thank you Thomas! After you pointed out that the issue could result from problematic PDFs, I tried RapidMiner on all PDF documents one-by-one and identified two PDFs that throw error messages. All the other PDFs work perfectly!

    And I took the other two PDFs and loaded them into the PDF24 Creator and saved them again using the PDF24 Creator. Now these two PDFs can also be processed by RapidMiner.

    Thank you again!

    Jan

    sgenzer Thomas_Ott
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    @jan_spoerer伟大的听!祝你好运与你的文本处理g!

    sgenzer
  • vicwellervicweller MemberPosts:1Newbie

    Agreed with the solution: I ran into the same, didn't figure out to read what the error stood for. Tried to encrypt the document through the third-party application, this one did the thinghttps://edit-pdf.pdffiller.com/and then the script works as supposed to. I guess it would be possible to do with the Acrobat, which is more common thing for such purposes, but it's cheaper and I don't know any PDF encryption tool for free

Sign InorRegisterto comment.