"TextMining using LibSVMLearner -- does sort order of Excel input file matter?"

wotsiznamizwotsiznamiz MemberPosts:9Contributor II
edited May 2019 inHelp
I am using the following code to text-mine a ~10,000 row Excel Record Set. The Excel file has three columns: (1) the label, (2) the text, and (3) the ID.

I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different. For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID. Should I always be sorting on the label column? I would have thought that RapidMiner would produce the same results on inputs sorted in any manner. Is this a bug? Can I rely on my results after seeing this behavior?



<参数键=“resultfile”值= " C: \ RapidMiner \ NPS_PaymentStatus\Result_file.res"/>



























<参数键= "的例子_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>













































<参数键= "的例子_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
















Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Hi,

    your process in general looks good to me (at least from viewing at the XML code alone;))

    I would have thought that RapidMiner would produce the same results on inputs sorted in any manner.
    Not necessarily. This completely depends on the learning scheme. However, with a 2-fold cross validation alone you can probably not really take any definite statement about the performance of the models. If the dramatic change in prediction performance still is true for a 10 times 10-fold cross validation I would be more worried;)

    Cheers,
    Ingo
  • wotsiznamizwotsiznamiz MemberPosts:9Contributor II
    THX!
Sign InorRegisterto comment.