Issue With Loop Files Operator
Hi all,
I'm new to the forum and RapidMiner so excuse any redundancies or lack of details.
I am working with the process from Chapter 14 (Robust Language Identification) of RapidMiner: Data Maning Use Cases and Business Analytics Applications published by CRC press. The process was downloaded from here:http://rapidminerbook.com/index.php/chapter-downloads-13-24/chapter-14/
attachment 1 shows a screenshot of the process and attachment 2 of the loop files sub-process
I successfully loaded the process, and downloaded the language corpora fromhttp://corpora.informatik.uni-leipzig.de/download.html
I changed the directory for the loop files operator to read from the folder where the corpora is stored. There are five files in the directory (german, english, french, portugese and spanish). the loop files operator seems to be sucessfully reading all of them, but gives a 6th output which seems nonsensical. attachment 3 shows the expected output for any language file (enlgish in this case). attachment 3 shows the nonsensical output. Attachment 5 shows the error thrown, presumably by the nonsense output. Could someone tell me why it's happening and how to fix it? Thanks!
Best Answer
-
thapli_64 MemberPosts:18Maven
So, I was able to solve the issue (with some debugging help from a colleague- always good to have someone to talk things through with)!
I set up regex filtering (.*\.txt$) in the loop operator to only read in the desired files, in this case the 5 language files ending in .txt
There was, however, another error that cropped up after this was fixed- a duplicate attribute error wrt the 'text' attribute. This was due to the 'select attributes and weights' parameter in the Data to Documents operator being checked but no value being provided for it. it seems this was the case with the process as it was downloaded and not introduced through human error (or so I'm telling myself :P )
1
Answers
hello@thapli_64- welcome to the community. So first it would be much easier if you could please share your XML process in this thread (see "Helpful Reminders" on the right when you reply) as then we can truly replicate what you are doing. Second, I just looked at that process from RapidMinerBook and the Loop Files operator that is used was deprecated since the last release:
deprecated loop files operator on lefthere's the new loop files operator
So I would suggest moving the operators inside the old "Loop Files" and rewiring them inside a new "Loop Files" and try again. Then paste your XML here and we will see what you have.
Scott
斯科特,
Thanks for the welcome, reply and advice!I had already tried what you mentioned and it threw the same error. I have attached screenshots. See the XML below:
Thanks!
Racchit.
<运营商激活= " true "类= compati“过程”bility="6.0.002" expanded="true" name="Process">
<运营商激活= " true " class = " read_csv”兼容ibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
<运营商激活= " true " class = " read_csv”兼容ibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="112" y="242">
Thomas,
它实际上是在数据从5 expected data files (see german corpus screenshot attached). the error is being thrown by the 6th unexpected input. I need to figure out how to get rid of that. It shouldn't be there.
hello@thapli_64- ok thanks for that (please next time use the > tool to insert your XML). So the best way to debug this is to use "breakpoints" and see what the data looks like right before the operator that is causing the trouble:
Add a breakpointbreakpoint added
My guess is the same as@Thomas_Ott- you will see that the sixth time the attribute "language" is not there.
Scott
斯科特,
Thanks again for the advice- I'm learning as I go!
Yes, I have been using breakpoints to figure out what was happening and that's how i discovered the extra data being read in. You are right, the error is indeed being thrown by the language attribute missing at that point. I had discovered earlier but didn't share because i felt the root cause was the 6th datset being read in which shouldn't be there in the first place. am I wrong to assume that? is it supposed to be there?
See screen shots attached. as you can see, I added a breakpoint to see what's being fed into the set role label operator. the results screenshot shows that the text and language attributes are there for the 5 expected documents but missing for the 6th one. Now obviously the loop files sub-process is set up to deal with the actual language corpus files effectively, but not this error case. so it seems the error case shouldn't even be there at all. pleasse correct me if i'm wrong.
So@sgenzer, the culprit seems to the a '.DS_STORE' file being read in by the read csv operator from the directory, which in turn spits out an erroneous result (attachment 1).
attachment 2 shows the expected results for the deutsch file.
attachment 3 shows that there are only 5 files in the folder. I couldn't find any hidden files.
How do i get around this? why is the loop files/read csv operator reading this file?
ah it's funny I was almost going to ask if you were on a Mac because if you are and don't use a RegEx expression to filter out the .DS_STORE file, you're going to have issues. It's a hidden file that causes sorts of challenges. Glad you sorted it out yourself.
Scott
Thanks Scott! This was particularly vexing but the process was educational. I learned a lot about each of the operators, and developed more confidence with RapidMiner, through this debugging process- my first serious one with RM.
my pleasure@thapli_64. Enjoy the RapidMiner ride. It's a blast.
Scott