"Different results for X-Validation (libSVM) in version 4.6

alexxalexx MemberPosts:12Contributor II
edited June 2019 inHelp
Dear community,

I am upgrading from rapidminer version 4.6 to 5 and I'm having some difficulties that I hope maybe someone can help me with.

I am using a data set consisting of 40 example set rows with 73 attributes (72 numerical + 1 numerical label). If anyone wants to reproduce the steps, here is the data in Excel format:http://jump.fm/PFMGS.

In rapidminer 4.6 I start the wizard, open x-validation with svm, import my data, and start the process. The result is 100% accuracy. Here are some screenshots:http://img696.imageshack.us/img696/2939/rapidminer4results.png

I tried to reconstruct this in rapidminer 5:
- I imported the data into my repository and created a new process
- Since the imported data was marked nominal by rm, I use Nominal to Numerical converter for the complete dataset
- the output goes into X-Validation module (default parameters as in rm 4.6). from there ave-output goes to results
- in the Validation module it looks like this
-- in training module there is the libSVM module (C-SVC, rbf kernel, gamma=0, C=32, epsilon = 0.0010, same as in rm 4.6)
-- in testing module I use Apply Model and then Performance Module (same default values as in rm 4.6

executing the process results in 90% accuracy. Screenshots:http://img42.imageshack.us/img42/9720/rapidminer5results.png

Did I make a mistake? Thanks for your help.
Alex

Answers

  • harri678harri678 MemberPosts:34Maven
    Hello Alex,

    did you ever try to set gamma != 0? As i understand correctly gamma=0 means, that it will be effectively set to 1 / num_attributes. I would recommend to set it fixed in both versions for comparable results (1/72). Also I recognized a difference in the random_seed parameter of the X-Validation operator which could affect the process.
    I'm curious if this changes anything!


    Just my two cents;)

    Greetings, Harald
  • alexxalexx MemberPosts:12Contributor II
    Thanks for your reply, Harald.

    I used different values for gamma and played around with with random seed settings. Still the accuracy results from version 4.6 and 5 differ a lot using same input. Does anyone know why?
  • Stefan_EStefan_E MemberPosts:53Guru
    Alex,

    if you don't do an XValidation - just build one model: Does it differ? - that would implicate the learner (as opposed to the applier).

    Stefan
  • dragoljubdragoljub MemberPosts:241Maven
    Cross validation results will always be slightly different since you are randomly splitting the training set into subsets for training and validation. Unless you can ensure that the cross validation splitting is performed exactly the same between each run you should expect slightly different results. If you notice a huge difference there my be something wrong.
  • alexxalexx MemberPosts:12Contributor II
    dragoljub,

    thanks for your input. If I use the same random seed parameters on both versions, I should get the same results in my understanding. Anyway, the results differ not just slightly (100% in RM 4 vs 90% in RM 5).
  • haddockhaddock MemberPosts:849Maven
    Hi Folks,

    If you import the xls and run the following you'll see what the problem is ....
















































    <连接from_op = from_p“检索”ort="output" to_op="Nominal to Numerical" to_port="example set input"/>
    <连接from_op = "名义数值“from_port ="example set output" to_op="XValidation" to_port="training"/>








    The operator "Nominal to Numerical" has replaced each attribute column with 0-39:(The fact that it still produces 90% satisfies our gullibility.

    PS Rather ironically, if you replace the offending operator with a "Guess Types" operator all is well, like this....

















































    <连接from_op = from_p“检索”ort="output" to_op="Guess Types" to_port="example set input"/>











  • alexxalexx MemberPosts:12Contributor II
    thanks Haddock for finding the problem.

    Is there any way I can fix the "nominal to numerical" operator in rm5? Or any other workaround?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2,531Unicorn
    Hi,
    what exactly is the problem with the nominal to numerical operator? It's behavior is exactly as it was in 4.x if you don't change the default parameter settings. Please remember, that you had to include the nominal to numerical operator in 4.x in an AttributeSubetPreprocessing operator to restrict the attributes it was working on. You might now either use the equivalent Select Subset operator or simply use the built in filter.

    Greetings,
    Sebastian
  • alexxalexx MemberPosts:12Contributor II
    Sebastian,

    thank you for your answer. I imported values from a csv file that looked like this.
    2.3647619e+000,9.5738476e-001,9.6855298e-001,...
    Unfortunately the real values were recognized as nominal so I wanted to use the nominal to numerical operator to mark them as numerical. But that operator simply converted the values to numerical 1, 2, 3 and so on. So I guess I just misunderstood the intention of the operator. I needed a 'real' converter.

    My problem still remains. I cannot import the data as numerical, but at least I could figure out why. My data is in scientific notation (Matlab standard). A value with the exp != 000 is correctly imported as numerical (real), whereas a value with the exponent == 000 is imported as nominal.

    so
    2.6855298e-001
    is correctly imported as numerical

    and
    2.3647619e+000
    是不正确的imported as nominal.

    I would really appreciate if anyone has a solution for me. Again, RM4 correctly imports those values as numerical:(
    Thanks!
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2,531Unicorn
    Hi,
    please replace the nominal to numerical operator by the parse numbers operator. That will help you solve your problem.

    Greetings,
    Sebastian
  • alexxalexx MemberPosts:12Contributor II
    Sebastian,

    thanks for your help. Unfortunately that did not solve the problem. The Parse Numbers operator still labels numbers like 2.3647619e+000 as nominal, but I want them to be numerical/real.

    See screenshot:http://img684.imageshack.us/img684/7505/nominalnumericalproblem.png

    Any idea how I can achieve that?
  • haddockhaddock MemberPosts:849Maven
    Hi Folks,


    http://rapid-i.com/rapidforum/index.php/topic,1791.msg7012.html#msg7012

    Using the solution so darkly hidden therein on this csv data..

    2.6855298e-001,2.3647619e+000
    2.3647619e+000,2.6855298e-001

    I find that the numbers are read as reals by the following code...


























  • alexxalexx MemberPosts:12Contributor II
    Haddock,

    thank you for your help. Your solution works partially... I'm getting weird behavior here:

    In your example, the values are labeled as real in the results workspace (screenshot:http://img140.imageshack.us/img140/6470/88436391.png)

    but I need to work with the values in the process. THERE the same values in that example are labeled nominal (sreenshot:http://img179.imageshack.us/img179/6517/18861165.png)

    So in the process I cannot use the values as input for libSVM etc. I really don't understand this, maybe someone can explain/post a solution?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Hi Alexx,

    the reason is quite simple: everything is fine and this is just the way "Guess Types" behaves. It guesses the types but from the real data (which is not available in the meta data transformation) and not from the meta data. That means that the meta data cannot be correctly updated during process design. I would recommend to perform Haddocks process and store the data in the RM repository. There, you will easily see that the type is correct. Just use the data from the respository then and feed it into the learner and everything will be fine.

    Alternatively, you could simply feed the data into the LibSVM after the transformation process. It wíll complain but you disable those complains in the preferences: simply activate "general.capabilities.warn". However, the best way is to use the repository here.

    Cheers,
    Ingo
  • alexxalexx MemberPosts:12Contributor II
    Thank you for your help. By disabling the complains I could get it to work the way I wanted to.

    An importing wizard like used in RM4 would make it a lot easier. Hope something like that will find its way into the new release. I'm very much looking forward to that;)
  • dragoljubdragoljub MemberPosts:241Maven
    If you want to avoid the headache you can just have MATLAB generate CSV files in decimal, without using the scientific notation. RM should be able to handle the scientific notation, but I think that you should have no problem reading your results as decimal.

    -Gagi
  • alexxalexx MemberPosts:12Contributor II
    Gagi,
    sure I could do that. But IMO rapidminer should have no problem reading scientific notation. I'll stick with the headache solution until there is an improved import utility in RM.
    Thanks to everyone for helping me out with that one.
Sign InorRegisterto comment.