"Unexpected Regression Performance Using Cross-Validation"
Hi all,
I tried to do SVM Regression using LibSVM. When I measure the performance of this learner without cross validation (using the whole data as training set), it gives the following results:
absolute_error: 8618.717 +/- 19520.661
relative_error: 102.25% +/- 631.07%
correlation: 0.873
prediction_average: 35706.987 +/- 42654.440
However, when I add 10-fold cross validation in the workflow, I got really different result:
absolute_error: 28596.955 +/- 3938.106 (mikro: 28591.849 +/- 30064.573)
relative_error: 395.80% +/- 192.38% (mikro: 395.36% +/- 1,329.27%)
correlation: 0.320 +/- 0.126 (mikro: 0.303)
prediction_average: 35707.687 +/- 5282.379 (mikro: 35706.987 +/- 42654.440)
Is it normal to face this kind of situation, especailly when we use SVM Regression?
Is there any way to improve this performance?
FYI, the dataset consists of around 500 instances with 80 attributes. Originally it only has 6 attributes, two of them are textual and I converted to WordVector (TF-IDF) and the rest are nominal which are converted into binary.
For the learner, I use epsilon-SVR with gamma = 1.0 and C = 100000.0. Those parameter are the results of Optimization process.
Thanks in advance.
Cheers,
Ikhwan
This is the XML file for the cross-validation:
I tried to do SVM Regression using LibSVM. When I measure the performance of this learner without cross validation (using the whole data as training set), it gives the following results:
absolute_error: 8618.717 +/- 19520.661
relative_error: 102.25% +/- 631.07%
correlation: 0.873
prediction_average: 35706.987 +/- 42654.440
However, when I add 10-fold cross validation in the workflow, I got really different result:
absolute_error: 28596.955 +/- 3938.106 (mikro: 28591.849 +/- 30064.573)
relative_error: 395.80% +/- 192.38% (mikro: 395.36% +/- 1,329.27%)
correlation: 0.320 +/- 0.126 (mikro: 0.303)
prediction_average: 35707.687 +/- 5282.379 (mikro: 35706.987 +/- 42654.440)
Is it normal to face this kind of situation, especailly when we use SVM Regression?
Is there any way to improve this performance?
FYI, the dataset consists of around 500 instances with 80 attributes. Originally it only has 6 attributes, two of them are textual and I converted to WordVector (TF-IDF) and the rest are nominal which are converted into binary.
For the learner, I use epsilon-SVR with gamma = 1.0 and C = 100000.0. Those parameter are the results of Optimization process.
Thanks in advance.
Cheers,
Ikhwan
This is the XML file for the cross-validation:
Tagged:
0
Answers
A question, when you say... How did you do the optimisation, and on what data? The reason I ask is that overtraining with SVMs is a well-known pitfall, and this issue keeps popping up.
Do you have any suggestion for this situation? Should I split my data, but how much should I give for optimization?
For optimization, I just follow one workflow discussed previously in the forum. This is the XML file: