创建从pdf文件通过一个词库n URL

jan_spoerer · March 2018

I would like to create a wordlist (for applying a machine learning model that was specified before) with a PDF as a source. This usually works using the Process Documents operator. But I need to access the PDF via an URL. I thought about using the Web Mining extension for this.

The Get Pages operator does not work, it seems to only accept HTML as an input. The output from this operator is just a random string of "strange" characters so there seems to be a problem with the data format, i.e., with PDF.

The Process Documents from Web operator does not work at all. No wordlist can be created.

The Get Page operator also does not work because I cannot convert the PDF to a document. There seems to be no operator to do that. A PDF to Document operator would be great because then I could just use the Process Documents operator to create my wordlist.

Is there an operator that converts PDFs to documents? Is there an operator that creates a wordlist from a PDF that is accessed via an URL? Is there any other way to create wordlists from PDFs that are accessed via an URL?

You can find the code below. Thank you.










https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>

https://www.db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>


















<运营商激活= " false " class = "文本:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents (2)" width="90" x="313" y="238">




















https://www.db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>















https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf
https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>


https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>

Thomas_Ott · March 2018

@jan_spoerer有一个阅读PDF表格操作符但通常the PDF's can be processed from file directories using Process Documents from Files operator. So you might have to download them first and put them in a directory. Maybe someone in the community has a neat hack on making this easier.

jczogalla · March 2018

Hi@jan_spoerer,

you can use the Open File Operator to get a RM file object from an URL (select this option from theresource typeparameter). You can iterate over your URLs with a Loop operator, read them as files and then use the Operator that@Thomas_Ottsuggested.

If you have the URLs as a table/example set, you could for example use the Loop Values Operator.

Cheers

Jan

jan_spoerer · March 2018

Thanks Thomas for that suggestion. Until now, we accessed the PDFs directly via the directory but this is exactly what we don't want to do from now on because we thought it may be easier to just generate an URL in our web app that RM can access. Yet, unfortunately, we may need to resort to downloading the PDF to the repository, as you said. But I will first try Jan's solution.

jan_spoerer · March 2018

Thanks@jczogalla, I will try your suggestion and use the direct repository access (as suggested by@Thomas_Ott) as a fallback option. Will update you soon.

jan_spoerer · March 2018

@Thomas_Ott @jczogalla

Just to be clear: The PDF is just normal text. No table extraction is wanted here.

One part of the problem is solved: Using the Open File operator, I can now download the PDF. As there seems to be no way to directly create a wordlist from that PDF, I stored it in the repository as suggested. Now I would like to use the Process Documents from Files operator to access the file in the directory but I cannot access the file as the process is on a RM Server and the downloaded PDF is stored in the server repository. When I click on the "text directories" parameter of the Process Documents from Files operator, the RM Server repository just does not show as an available directory.

So I'm stuck right at this step:

@Thomas_Ottwrote:
@jan_spoerer有一个阅读PDF表格操作符但通常the PDF's can be processed from file directories using Process Documents from Files operator. So you might have to download them first and put them in a directory.

The file is downloaded and stored successfully but I cannot access it using the Process Documents from Files operator.

Is there a way to use the Process Documents from Files operator to access a file on the server repository?











https://db.com/ir/en/download/Deutsche_Bank_Annual_Report_2016.pdf"/>


















<连接from_op = from_port =“变换情况下(4)document" to_port="document 1"/>

jczogalla · March 2018

You could store the files in a temp folder (with Write File Operator, e.g. to /tmp/RM/process_name/x) and then use that folder in Process Documents from Files. But you are right, it would be better if the Operator actually had an input port for files or could loop in the repository.

Thomas_Ott · March 2018

@jczogallaI think that's the more elegant solution, allowing an input port on the Read Documents from Files operator. So, when can we have it?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

创建从pdf文件通过一个词库n URL

Answers