创建从pdf文件通过一个词库n URL

jan_spoererjan_spoerer MemberPosts:10Contributor I
edited October 2019 inHelp

I would like to create a wordlist (for applying a machine learning model that was specified before) with a PDF as a source. This usually works using the Process Documents operator. But I need to access the PDF via an URL. I thought about using the Web Mining extension for this.

The Get Pages operator does not work, it seems to only accept HTML as an input. The output from this operator is just a random string of "strange" characters so there seems to be a problem with the data format, i.e., with PDF.

The Process Documents from Web operator does not work at all. No wordlist can be created.

The Get Page operator also does not work because I cannot convert the PDF to a document. There seems to be no operator to do that. A PDF to Document operator would be great because then I could just use the Process Documents operator to create my wordlist.

Is there an operator that converts PDFs to documents? Is there an operator that creates a wordlist from a PDF that is accessed via an URL? Is there any other way to create wordlists from PDFs that are accessed via an URL?

You can find the code below. Thank you.










https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>

https://www.db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>


















<运营商激活= " false " class = "文本:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents (2)" width="90" x="313" y="238">




















https://www.db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>















https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>


https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>





































Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    @jan_spoerer有一个阅读PDF表格操作符但通常the PDF's can be processed from file directories using Process Documents from Files operator. So you might have to download them first and put them in a directory. Maybe someone in the community has a neat hack on making this easier.

    jan_spoerer
  • jczogallajczogalla Employee, MemberPosts:144RM Engineering

    Hi@jan_spoerer,

    you can use the Open File Operator to get a RM file object from an URL (select this option from theresource typeparameter). You can iterate over your URLs with a Loop operator, read them as files and then use the Operator that@Thomas_Ottsuggested.

    If you have the URLs as a table/example set, you could for example use the Loop Values Operator.

    Cheers

    Jan

    jan_spoerer
  • jan_spoererjan_spoerer MemberPosts:10Contributor I

    Thanks Thomas for that suggestion. Until now, we accessed the PDFs directly via the directory but this is exactly what we don't want to do from now on because we thought it may be easier to just generate an URL in our web app that RM can access. Yet, unfortunately, we may need to resort to downloading the PDF to the repository, as you said. But I will first try Jan's solution.

  • jan_spoererjan_spoerer MemberPosts:10Contributor I

    Thanks@jczogalla, I will try your suggestion and use the direct repository access (as suggested by@Thomas_Ott) as a fallback option. Will update you soon.

  • jan_spoererjan_spoerer MemberPosts:10Contributor I

    @Thomas_Ott@jczogalla

    Just to be clear: The PDF is just normal text. No table extraction is wanted here.

    One part of the problem is solved: Using the Open File operator, I can now download the PDF. As there seems to be no way to directly create a wordlist from that PDF, I stored it in the repository as suggested. Now I would like to use the Process Documents from Files operator to access the file in the directory but I cannot access the file as the process is on a RM Server and the downloaded PDF is stored in the server repository. When I click on the "text directories" parameter of the Process Documents from Files operator, the RM Server repository just does not show as an available directory.

    So I'm stuck right at this step:


    @Thomas_Ottwrote:

    @jan_spoerer有一个阅读PDF表格操作符但通常the PDF's can be processed from file directories using Process Documents from Files operator. So you might have to download them first and put them in a directory.

    The file is downloaded and stored successfully but I cannot access it using the Process Documents from Files operator.

    Is there a way to use the Process Documents from Files operator to access a file on the server repository?











    https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>


















    <连接from_op = from_port =“变换情况下(4)document" to_port="document 1"/>















  • jczogallajczogalla Employee, MemberPosts:144RM Engineering

    You could store the files in a temp folder (with Write File Operator, e.g. to /tmp/RM/process_name/x) and then use that folder in Process Documents from Files. But you are right, it would be better if the Operator actually had an input port for files or could loop in the repository.

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    @jczogallaI think that's the more elegant solution, allowing an input port on the Read Documents from Files operator. So, when can we have it?:)

    sgenzer
Sign InorRegisterto comment.