"TextMining using LibSVMLearner -- does sort order of Excel input file matter?"
wotsiznamiz
MemberPosts:9Contributor II
I am using the following code to text-mine a ~10,000 row Excel Record Set. The Excel file has three columns: (1) the label, (2) the text, and (3) the ID.
I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different. For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID. Should I always be sorting on the label column? I would have thought that RapidMiner would produce the same results on inputs sorted in any manner. Is this a bug? Can I rely on my results after seeing this behavior?
<参数键=“resultfile”值= " C: \ RapidMiner \ NPS_PaymentStatus\Result_file.res"/>
<参数键= "的例子_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>
<参数键= "的例子_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different. For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID. Should I always be sorting on the label column? I would have thought that RapidMiner would produce the same results on inputs sorted in any manner. Is this a bug? Can I rely on my results after seeing this behavior?
<参数键=“resultfile”值= " C: \ RapidMiner \ NPS_PaymentStatus\Result_file.res"/>
<参数键= "的例子_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>
<参数键= "的例子_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
Tagged:
0
Answers
your process in general looks good to me (at least from viewing at the XML code alone)
Not necessarily. This completely depends on the learning scheme. However, with a 2-fold cross validation alone you can probably not really take any definite statement about the performance of the models. If the dramatic change in prediction performance still is true for a 10 times 10-fold cross validation I would be more worried
Cheers,
Ingo