"Import data PDF documents"
Hi there!
I'm completely new to rapid miner - and can't manage to import PDF files into the repository.
It says that it's an unknown file type. I'm sorry for the completely (!) basic question, but I can't find anything about that in the getting started training.
Thank you very much for your help!
Best Answer
-
MarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, MemberPosts:270Unicorn
Hi what do you want to do with the PDF? I guess you are goin to try to do some text minning with them or you will try to extract some table data from them.
You will need yo install the text minning extensions. In order to do so you need to open Extension->MarketPlace and serch for the extension.
In case you need to extract tables alse install data table extraction.
After that you need to build a process that suits your need.
I'm posting and example
< ?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程版本sion="8.2.000">
<运营商激活="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<运营商激活= " true " class = "文本:read_document" compatibility="8.1.000" expanded="true" height="68" name="Open a PDF" width="90" x="112" y="34"/>
<运营商激活="true" class="pdf_table_extraction:pdf2exampleset_operator" compatibility="0.2.001" expanded="true" height="68" name="Get tables from a PDF" width="90" x="112" y="136"/>
<运营商激活= " true " class = "并发性:循环_files" compatibility="8.2.000" expanded="true" height="68" name="Use this one for more than One PDF" width="90" x="246" y="34">
<运营商激活= " true " class = "文本:read_document" compatibility="8.1.000" expanded="true" height="68" name="Open a PDF (2)" width="90" x="246" y="34"/>In order tu paste the xml code I just posted you will need to click on
View->Show Panel->XML you'll see a new view called XML remove the code and paste the one I gave you, Click on on the green check and then return to the process view.
For further steps you can follow this videos:
https://www.youtube.com/watch?v=ophGqpUexKI&list=PLssWC2d9JhOZLbQNZ80uOxLypglgWqbJA
4
Answers
Hi@nina_ploetzl,
to add to@MarcoBarradas' fantastic comment: Read Document has an option to read the text of a PDF. It's also part of text mining extension.
BR,
Martin
Dortmund, Germany
Thank you very much@mschmitz, that's what I was looking for as a first to import the PDFs.
But I can just import them one by one. Is there any possibility to import hundreds at one time?
Thank you very much@MarcoBarradasfor the extensive reply! I will try that but I'll need some time, like I said I'm completely new to rapid miner and I've no experience with data mining or anything related.
I wanna analysze a set of 1700 research articles, and I wanna kind of classify and analyze them which research method they use. So I wanna import these PDFs, and look for specific words in them and if they do contain these words I want to categorize them into groups...
Hi@nina_ploetzl,
you can use loop files to iterate over folders and pass the full path to read documents.
分类:看看提取ics from Documents. It's an operator which is part of the operator toolbox extension.
Best,
Martin
Dortmund, Germany
Hello@mschmitz!
Thank you so much for helping me.
I tried this version now, but somehow it doesn't work when I press run - and the tutorials on YouTube are with old versions that look different.
It says "not enough iterations" or that there is no output from the Loop Files operator...see the attached screenshot...
Best, Nina
you need to put the Read Document operator Into the Loop Files.
Further you want to use regex with .+ on the folder to catch all documents,
cheers,
Martin
Dortmund, Germany
This is my XML code for this process which is working:
-----> but now when I use the Operators Tokenize, Stem and Filter Stopwords in the inside process of Loop files --> it doesn't work anymore. The problem is at the "Documents to Data" operator. If I put a breakpoint before it works and tokenizes, stems, filters all the examples correctly. But if I put a breakpoint after this operator it just gives me the whole unreduced text. And if I look at the whole outcome of the process, it just gives me 1 example anymore. Not the 20 which I imported.
i think you pointed me to an issue with our dataframe.. Can you check if the attached process does what you want to do?
BR,
Martin
< ?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程版本sion="9.0.003">
<运营商激活="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<运营商激活= " true " class = "并发性:循环_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files" width="90" x="112" y="34">
<运营商激活= " true " class = "文本:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34"/>
<运营商激活="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34">
<运营商激活="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="380" y="136"/>
<运营商激活="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="136"/>
<运营商激活="true" class="execute_script" compatibility="9.0.003" expanded="true" height="82" name="Execute Script" width="90" x="581" y="34">
return new Document(buffer.toString());"/>
<运营商激活="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="849" y="34">
Dortmund, Germany
Dortmund, Germany
I took Martin xml and tweaked it a little.
The process I attached reads all the txt files from the directory you will set at text directories on the Process Documents from Files operator.
In the inner process I pasted your tokeniz, stem and Filter Stop word.
After the process finishes you will connect to the WordList to Data with this you will know how often a word is used and on how many documents it appears.
The other part will extract the text we got from each file (Select Attributes) and will convert each example to a document that will then be connected o the Extract Topics from Documents.
For the next step we will need Matins help since I don't know that well the extension but we are closer to the part where we will create a model that classifies each document.
The TF-IDF vector is on theexaport output of the Process Documents From Files operator. Since you are trying to cluster documents I guess you can connect a cluster operator to that output and it may show you your first cluster.
I saw you tokenized on none letters so you are I don´t know if is a better idea to tokenize on linguistic sentences.@mschmitz,@Thomas_Ott,@IngoRMwhat would be your advice?
when I run the process until "Select Attributes" it counts 0 words for "audit", "auditor" at every example oft my data set
w.r.t the "text" but, somewhat yet. It's simply not allowed to have the attribute twice. Not sure how to overcome this though.
BR,
Martin
Dortmund, Germany
You can fin more about the subject at this link :https://www.commonlounge.com/discussion/99e86c9c15bb4d23a30b111b23e7b7b1
A workaround for the text attribute existence would be to rename the attribute
I did some tweaks to the previous version