Text Mining of multiple PDF files with separate key word counts

bazi66bazi66 成员Posts:5Contributor I
edited December 2018 inHelp

Hello all,

I am new to this community and hope that somebody can help me. I already searched the forum a lot and found very good topics, but I couldn't find a proper solution for my task. Here's what I want to do:

I have about 500 PDF files and want to text mine them and compare the results to key words I already have in Excel.

The problem is, that I want to get a word count and a comparison for each PDF file (not overall) and a column for the results in an Excel Sheet. When I start my process with the "Process from Files" with a "Tokenize" operator in it, I only get back the sum over all documents, but not for each PDF file.

I already tried it with a different approach: A "Loop" operator, starting with the "Read from document" Process. I got no results out of that.

I attached my approaches (I use RapidMiner Studio). Can someone maybe help me with the right approach and the correct process map?

Thank you very much for your help in advance!

1st approach::










<运营商激活= " true " class = "文本:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="136">




<参数键= "提取_text_only" value="false"/>


























2nd approach:
















<运营商激活= " true " class = "文本:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="45" y="136">




































Best Answer

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    Solution Accepted

    hi@bazi66- does this help? I disabled the Write Excel but you can obviously re-enable it if you want.


















    <运营商激活= " true " class = "文本:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="179" y="34">


























    < portSpacing端口= " sink_output 1”间隔= " 0 " / >













































    < portSpacing端口= " sink_output 1”间隔= " 0 " / >























    Scott

    [EDIT - ok I think you probably want to aggregate by word. I did this in this next process and also added case transformation and some stemming. Just made sense to me.]


















    <运营商激活= " true " class = "文本:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="179" y="34">






























    < portSpacing端口= " sink_output 1”间隔= " 0 " / >













































    < portSpacing端口= " sink_output 1”间隔= " 0 " / >















    <运营商激活= " true "类=“聚合”同情tibility="9.0.001" expanded="true" height="82" name="Aggregate" width="90" x="313" y="34">














    Pavithra_Rao

Answers

  • Pavithra_RaoPavithra_Rao Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:123RM Data Scientist

    H@bazi66,

    I think 2nd approach will be more appropriate here. It requires a small change is in the process flow here.

    Place the Write excel operator outside 'loop files' operator with a 'Append' operator. This way you are getting the list of words in every pdf file into one excel sheet. Here is the updated process. Hope this helps.

    If you are still getting errors, please share the 2 or more pdf files here so we can take a look at it in detail.











    Cheers,

    sgenzer
  • bazi66bazi66 成员Posts:5Contributor I

    Hi,

    thanks for your quick response. Unfortunately I still don't get the expected result. The Excel file of the result contains 3 columns: word, in documents, total. I need the columns: word, count in PDF1, count in PDF2,..., total. A separate excel sheet/workbook for every PDF with the 3 columns would also be fine for me. I attached the code and 2 example PDFs. Thank you for your help!

    Code:
















    <运营商激活= " true " class = "文本:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="45" y="136">







































  • Pavithra_RaoPavithra_Rao Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:123RM Data Scientist

    Thanks for sharing the files@bazi66. Here is the updated process XML. This will output 2 (for each pdf) different wordlists as CVS files. I'm using pdf file name extracted as 'file_name' macro to name the CSV files. Hope this helps.
















    <运营商激活= " true " class = "文本:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="246" y="34">










































    Cheers,

  • bazi66bazi66 成员Posts:5Contributor I

    Hi,

    thank you very much for your help.

    @Pavithra_Raoyour solution works very well, the multiple CSV files wouldn't have been an issue for me. Thanks!

    @sgenzeryour solution works perfect and is more convenient for me. Thanks also for the additional process steps, the results look much better now.

    Cheers,

    sgenzer Pavithra_Rao
Sign InorRegisterto comment.