Loading Folder Names for Text Processing
所以我们有很多文件在不同的文件夹nd when we bring them in to RapidMiner for analysis, we believe the folder name to be important as an input or simply as an identifying piece of information so we want to read it.
我承认this is probably one of the things one never expects to have to do, and yet, I had to do this for a customer; and learning to do it deepens ones skills and showcases the flexibility of RapidMiner.
If you follow this article carefully, you can use the attached process and repeat what we have done here. There will be no data files attached because in every case these will be different and at different places in your file system. Let me explain what I am using and why:
- The Text Processing extension installed.
- 2 empty text files called:applezz.txtandorangezz.txt
- 1 folder in my Documents folder containing two folders with the nameappleandorangewhich, in turn, contain their respective text files mentioned above. And to make things clear:
C:\Users\KonstantinosBonikos\Documents\delete\apple\applezz.txt
C:\Users\KonstantinosBonikos\Documents\delete\orange\orangezz.txt
- The number 46. This number will be different for you and is derived by counting the number of characters before the name of the folder we want. In this case:
1. First we place aLoop Filessubprocess operator on to the Process area. Make sure to tick both therecursiveandenable macrostickboxes as below.
Don't worry if theenable parallel executionis not an option for the version of RapidMiner you are using, it is not important here.
Make sure to point thedirectorywhere you have your folders saved.
2. Double-click theLoop Filessubprocess operator and place a Read Document and a Process Documents operator with default values while connecting them as normal (like in the screenshot below):
Inside theProcess Documentsoperator is empty with a through connection:
What we are doing here is reading theapplezz.txtandorangezz.txtfiles as documents and by processing them, we are importing their path name as metadata.
3. We now take the data that is produced, which looks like this:
This is where the counting becomes important. We are going to create a couple of attributes next based on themetadata_path.
4. Connect the data output to a Generate Attributes operator and create the following attributes using formulas.
- TheClassNameattribute is set to whatever thefolder_namevalue is, using the expression%{folder_name}
Rememberfolder_namewas set as a macro by theLoop Filesoperator when we selectedenable macrosin step 1.
- TheFolderNameattribute is set by usingcut(Nominaltext,Numericstart,Numericlength).
-Nominaltextis the folder name as represented by%{folder_name}
-NumericstartThis means we need to know where the folder name starts in the path name, and in my case, it was at position 46.
-NumericlengthThis represents how many characters we count; and as these vary with folder name and it has to be a number. Therefore, we count the lenght of the total folder name and subtract the number of characters where the name we want starts bylength(%{folder_name})-46.
5. Run the process and we get the following results:
Which evidently, give us folder names as data.
Feel free to download the attached process as an .rmp file. These can be imported by File>Import Process.