进口Rapidminer Word文档

BrilliantDataBrilliantData MemberPosts:1Contributor I
edited December 2018 inHelp

On a project for a recent client I needed to apply some common Natural Language Processing (NLP) techniques to surveys they had gathered, but one of the requirements for the project was that the source document had to remain in Word's .docx format and couldn't be exported to .txt. RapidMiner was the tool of choice for this engagement since it is graphical in nature and has a very usable library for text analysis, but what it doesn't have is an operator that specifically imports .docx files.

Microsoft Word files are basically zip files that contain an XML representation of the actual document. It stands to reason that if you can unzip the wrapper and get to the XML inside, you have a good chance of being able to read the document and do whatever you need in terms of analysis. RapidMiner has an operator for executing custom Python scripts (if you download the Python extension), so I chose to start there and see if it could handle those tasks.

Using Python in RapidMiner

First we'll need to download the Python extension, which you can do by going to Extensions-->Marketplace in the menu at the top of the page. It's one of the most popular downloads, so just go to "Top Downloads," select it from the list, and click "Install Packages" at the bottom of the window. You'll need to restart RapidMiner afterwards for the extension's operators to become available.

image

To use a custom Python script, search for the "Execute Python" operator and drag it onto the workflow. Double-click and you'll see the usual parameter editing box on the top right of the screen, which should contain a button labeled "Edit Text." This is where we'll enter the code.

image

The Code

I try not to reinvent the wheel when coding, so I Googled the problem to see if someone had tackled it before me and someone definitely had. The code I used is below:

image

If you want to download it straight from Etienne's blog, just follow this link:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

The initial workflow looked like this:

image

After using Etienne's code to unwrap the .docx file, it was easily readable by the "Read Document" operator. After that I transformed all words to lowercase, tokenized them, removed stop words, then converted the resulting word list to data and loaded it into a database for analysis. Simple.

Telcontar120 Thomas_Ott Pavithra_Rao blake_galbreath rfuentealba

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    hello@BrilliantData- welcome to the community and thanks for sharing this! It's actually similar to another thread from last December about xlsx files (seehttps://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Forum/Extract-Sheet-name-from-an-Excel-file/m-p/44747).

    Scott

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Wonderful solution to a common problem! If you would be willing to post an anonymized version of the process, I am sure there are many community members that would be grateful!

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer Pavithra_Rao
  • orsan_awawdiorsan_awawdi MemberPosts:3Contributor I

    This is brilliant.

    I ca'nt find Read Document component? any idea .

    using Rapid Miner Studio 8.1

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Did you install the free text mining extension? All the document operators are in that and not in the base version of Sudio. Just search for Text Processing on the Marketplace and it will come up.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • orsan_awawdiorsan_awawdi MemberPosts:3Contributor I

    Yes, you are right, it is right there.

    出于某种原因,在一些凹痕是失败的issue. don't know why.

    ---

    Untitled7










    File "", line 21
    document = zipfile.ZipFile('C:/Users/orsana/Desktop/MMO.docx')
    ^
    IndentationError: unindent does not match any outer indentation level











    ---

    iden.jpg

  • orsan_awawdiorsan_awawdi MemberPosts:3Contributor I

    I think I know what is wrong here. I will fix

  • blake_galbreathblake_galbreath MemberPosts:4Contributor I
    This is a great article, but I still can't quite figure out how to actually get the word doc into the RM repository, in order to enter it into the process described above. I tried using the Import Data module, but it only seems to allow Binary, Excel, and CSV. Where do I go to import docx files?
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University ProfessorPosts:568Unicorn
    edited April 2020
    I got it as a Building Block.

    You just use the operatorOpen Fileto pass the Word Document, and then insert the building block here.

    Before pasting the building block into your system, remove the .txt extension I had to add.

    Usage:


    lionelderkrikor blake_galbreath
  • blake_galbreathblake_galbreath MemberPosts:4Contributor I
    rfuentealbaI believe this will work.
Sign InorRegisterto comment.