Issue With Loop Files Operator

thapli_64thapli_64 MemberPosts:18Maven
edited November 2018 inHelp

Hi all,

I'm new to the forum and RapidMiner so excuse any redundancies or lack of details.

I am working with the process from Chapter 14 (Robust Language Identification) of RapidMiner: Data Maning Use Cases and Business Analytics Applications published by CRC press. The process was downloaded from here:http://rapidminerbook.com/index.php/chapter-downloads-13-24/chapter-14/

attachment 1 shows a screenshot of the process and attachment 2 of the loop files sub-process

I successfully loaded the process, and downloaded the language corpora fromhttp://corpora.informatik.uni-leipzig.de/download.html

I changed the directory for the loop files operator to read from the folder where the corpora is stored. There are five files in the directory (german, english, french, portugese and spanish). the loop files operator seems to be sucessfully reading all of them, but gives a 6th output which seems nonsensical. attachment 3 shows the expected output for any language file (enlgish in this case). attachment 3 shows the nonsensical output. Attachment 5 shows the error thrown, presumably by the nonsense output. Could someone tell me why it's happening and how to fix it? Thanks!

1.png 264.5K
2.png 292.5K
3.png 339.8K
4.png 237.5K
5.png 287K

Best Answer

  • thapli_64thapli_64 MemberPosts:18Maven
    Solution Accepted

    So, I was able to solve the issue (with some debugging help from a colleague- always good to have someone to talk things through with)!:D

    I set up regex filtering (.*\.txt$) in the loop operator to only read in the desired files, in this case the 5 language files ending in .txt

    There was, however, another error that cropped up after this was fixed- a duplicate attribute error wrt the 'text' attribute. This was due to the 'select attributes and weights' parameter in the Data to Documents operator being checked but no value being provided for it. it seems this was the case with the process as it was downloaded and not introduced through human error (or so I'm telling myself :P )

    sgenzer

Answers

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    hello@thapli_64- welcome to the community. So first it would be much easier if you could please share your XML process in this thread (see "Helpful Reminders" on the right when you reply) as then we can truly replicate what you are doing. Second, I just looked at that process from RapidMinerBook and the Loop Files operator that is used was deprecated since the last release:

    Screen Shot 2017-10-25 at 5.26.21 PM.pngdeprecated loop files operator on leftScreen Shot 2017-10-25 at 5.27.41 PM.pnghere's the new loop files operator

    So I would suggest moving the operators inside the old "Loop Files" and rewiring them inside a new "Loop Files" and try again. Then paste your XML here and we will see what you have.


    Scott

  • thapli_64thapli_64 MemberPosts:18Maven

    斯科特,

    Thanks for the welcome, reply and advice!:)I had already tried what you mentioned and it threw the same error. I have attached screenshots. See the XML below:

    Thanks!

    Racchit.







    <运营商激活= " true "类= compati“过程”bility="6.0.002" expanded="true" name="Process">





    <运营商激活= " true " class = " read_csv”兼容ibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">






































































    <运营商激活= " true " class = " read_csv”兼容ibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="112" y="242">










































































































  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn
    The error is pointing to the fact that the attribute called language is not in the data set after it’s been processed from the process documents from data operator. Double check if the attribute language is in the output before it reaches the set role operator
  • thapli_64thapli_64 MemberPosts:18Maven

    Thomas,

    它实际上是在数据从5 expected data files (see german corpus screenshot attached). the error is being thrown by the 6th unexpected input. I need to figure out how to get rid of that. It shouldn't be there.

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    hello@thapli_64- ok thanks for that (please next time use the tool to insert your XML:)). So the best way to debug this is to use "breakpoints" and see what the data looks like right before the operator that is causing the trouble:

    Screen Shot 2017-10-25 at 5.51.49 PM.pngAdd a breakpointScreen Shot 2017-10-25 at 5.52.00 PM.pngbreakpoint added

    My guess is the same as@Thomas_Ott- you will see that the sixth time the attribute "language" is not there.


    Scott

  • thapli_64thapli_64 MemberPosts:18Maven

    斯科特,

    Thanks again for the advice- I'm learning as I go!:)

    Yes, I have been using breakpoints to figure out what was happening and that's how i discovered the extra data being read in. You are right, the error is indeed being thrown by the language attribute missing at that point. I had discovered earlier but didn't share because i felt the root cause was the 6th datset being read in which shouldn't be there in the first place. am I wrong to assume that? is it supposed to be there?

    See screen shots attached. as you can see, I added a breakpoint to see what's being fed into the set role label operator. the results screenshot shows that the text and language attributes are there for the 5 expected documents but missing for the 6th one. Now obviously the loop files sub-process is set up to deal with the actual language corpus files effectively, but not this error case. so it seems the error case shouldn't even be there at all. pleasse correct me if i'm wrong.

  • thapli_64thapli_64 MemberPosts:18Maven

    So@sgenzer, the culprit seems to the a '.DS_STORE' file being read in by the read csv operator from the directory, which in turn spits out an erroneous result (attachment 1).

    attachment 2 shows the expected results for the deutsch file.

    attachment 3 shows that there are only 5 files in the folder. I couldn't find any hidden files.

    How do i get around this? why is the loop files/read csv operator reading this file?

    1.png 235.6K
    2.png 269.3K
    3.png 159.2K
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    ah it's funny I was almost going to ask if you were on a Mac because if you are and don't use a RegEx expression to filter out the .DS_STORE file, you're going to have issues. It's a hidden file that causes sorts of challenges. Glad you sorted it out yourself.:)


    Scott

    thapli_64
  • thapli_64thapli_64 MemberPosts:18Maven

    Thanks Scott! This was particularly vexing but the process was educational. I learned a lot about each of the operators, and developed more confidence with RapidMiner, through this debugging process- my first serious one with RM.

    sgenzer
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager

    my pleasure@thapli_64. Enjoy the RapidMiner ride. It's a blast.:)


    Scott

Sign InorRegisterto comment.