Loading Folder Names for Text Processing

KostasBonikosKostasBonikos MemberPosts:25Maven
edited November 2018 inKnowledge Base

所以我们有很多文件在不同的文件夹nd when we bring them in to RapidMiner for analysis, we believe the folder name to be important as an input or simply as an identifying piece of information so we want to read it.

我承认this is probably one of the things one never expects to have to do, and yet, I had to do this for a customer; and learning to do it deepens ones skills and showcases the flexibility of RapidMiner.

If you follow this article carefully, you can use the attached process and repeat what we have done here. There will be no data files attached because in every case these will be different and at different places in your file system. Let me explain what I am using and why:

- The Text Processing extension installed.

- 2 empty text files called:applezz.txtandorangezz.txt

- 1 folder in my Documents folder containing two folders with the nameappleandorangewhich, in turn, contain their respective text files mentioned above. And to make things clear:

C:\Users\KonstantinosBonikos\Documents\delete\apple\applezz.txt

C:\Users\KonstantinosBonikos\Documents\delete\orange\orangezz.txt

- The number 46. This number will be different for you and is derived by counting the number of characters before the name of the folder we want. In this case:

Loop Folder Names Article 1.png

1. First we place aLoop Filessubprocess operator on to the Process area. Make sure to tick both therecursiveandenable macrostickboxes as below.

Loop Folder Names Article 2.png

Don't worry if theenable parallel executionis not an option for the version of RapidMiner you are using, it is not important here.

Make sure to point thedirectorywhere you have your folders saved.

2. Double-click theLoop Filessubprocess operator and place a Read Document and a Process Documents operator with default values while connecting them as normal (like in the screenshot below):

Loop Folder Names Article 3.png

Inside theProcess Documentsoperator is empty with a through connection:

Loop Folder Names Article 4.png

What we are doing here is reading theapplezz.txtandorangezz.txtfiles as documents and by processing them, we are importing their path name as metadata.

3. We now take the data that is produced, which looks like this:

Loop Folder Names Article 5.png

This is where the counting becomes important. We are going to create a couple of attributes next based on themetadata_path.

4. Connect the data output to a Generate Attributes operator and create the following attributes using formulas.

Loop Folder Names Article 6.png

Loop Folder Names Article 7.png

- TheClassNameattribute is set to whatever thefolder_namevalue is, using the expression%{folder_name}

Rememberfolder_namewas set as a macro by theLoop Filesoperator when we selectedenable macrosin step 1.

- TheFolderNameattribute is set by usingcut(Nominaltext,Numericstart,Numericlength).

-Nominaltextis the folder name as represented by%{folder_name}

-NumericstartThis means we need to know where the folder name starts in the path name, and in my case, it was at position 46.

-NumericlengthThis represents how many characters we count; and as these vary with folder name and it has to be a number. Therefore, we count the lenght of the total folder name and subtract the number of characters where the name we want starts bylength(%{folder_name})-46.

5. Run the process and we get the following results:

Loop Folder Names Article 8.png

Loop Folder Names Article 9.pngLoop Folder Names Article 10.png

Which evidently, give us folder names as data.

Feel free to download the attached process as an .rmp file. These can be imported by File>Import Process.

MartinLiebig Pavithra_Rao lplenka
    Sign InorRegisterto comment.