"Why Matlab and Rapidminer give different results for SVM optimization"
Hi,
I'm using both Matlab and Rapidminer to do SVM classification with optimization for parameters. The data I used have 5000 obs, 36 integer attributes and one binomial label. I'm expecting similar results, yet they turned out to be different. The C statistics from Matlab is 0.672 while that from Rapidminer is 0.598. Also, they gives difference choices of optimal parameters for C and gamma. Rapidminer gives 0.25 and 0.25 respectively, and Matlab gives 4 and 0.25. I would greatly appreciate your help!
Below is the process code:
<运营商激活= " true "班ss="process" compatibility="5.3.008" expanded="true" name="Process">
<运营商激活= " true "班ss="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve donation_sarah5" width="90" x="45" y="30">
<运营商激活= " true "班ss="normalize" compatibility="5.3.008" expanded="true" height="94" name="Normalize" width="90" x="45" y="120"/>
<运营商激活= " true "班ss="split_data" compatibility="5.3.008" expanded="true" height="94" name="Split Data" width="90" x="179" y="255">
<运营商激活= " true "班ss="optimize_parameters_grid" compatibility="5.3.008" expanded="true" height="112" name="Optimize Parameters (Grid)" width="90" x="246" y="30">
<运营商激活= " true "班ss="x_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
A cross-validation evaluating a decision tree model.
<运营商激活= " true "班ss="support_vector_machine" compatibility="5.3.008" expanded="true" height="112" name="SVM" width="90" x="112" y="30">
<运营商激活= " true "班ss="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
<运营商激活= " true "班ss="performance_binominal_classification" compatibility="5.3.008" expanded="true" height="76" name="Performance" width="90" x="226" y="30">
<连接from_port = "模型" to_op =“应用模式”来说port="model"/>
<运营商激活= " true "班ss="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (2)" width="90" x="380" y="165">
<运营商激活= " true "班ss="performance_binominal_classification" compatibility="5.3.008" expanded="true" height="76" name="Performance (2)" width="90" x="581" y="255">
<连接from_op = from_port =“性能(2)考试ple set" to_port="result 3"/>
And here is my Matlab code:
clear all;
load donation;
Y = donation(:,40);
X = donation(:,2:36);
B = donation(:,40)> prctile(Y,80);
a_logical = logical( B );
B1 = a_logical + 0;
%check what percentage of donation are from the 20% people
ptg = sum(Y.*B1)/sum(Y)*100;
disp('percentage of donation that are from the top 20% people');
disp(ptg);
%randomly split the data into 80% and 20%
A = [X B1];
numA = size(A, 1);
trainsize = floor(0.8 * numA);
testsize = numA - trainsize;
ridx = randperm(numA);
traindata = A(ridx(1:trainsize),:);
testdata = A(ridx(trainsize + 1 : end),:);
Xtestdata = testdata(:,1:35);
B1testdata = testdata(:,36);
Xtraindata = traindata(:,1:35);
B1traindata = traindata(:,36);
n = size(B1traindata,1);
%cross-validation
%Gaussian Radial Basis Function kernel
L = [1/4 1 4];
AUCtrain = [];
for j = L(1:1:3)
for m = L(1:1:3)
indices = crossvalind('Kfold', n, 10);
Bp = [];
Br = [];
for i = 1:10
test = (indices == i); train = ~test;
xtst = Xtraindata(test,:);
ytst = B1traindata(test,:);
xtr = Xtraindata(train,:);
ytr = B1traindata(train,:);
SVMStruct = svmtrain(xtr,ytr,'kernel_function','rbf','RBF_Sigma', j ,'BoxConstraint', m);
Group = svmclassify(SVMStruct,xtst);
Bp = [Bp; Group];
Br = [Br; ytst];
end
[X1,Y1,T,AUCij] = perfcurve(Br,Bp,1);
AUCtrain = [AUCtrain;AUCij];
end
end
disp ('SVM_C statistics on the training data with ten-fold cross validation');
disp (AUCtrain');
%use the optimal parameter for the testdata
SVMStruct = svmtrain(Xtraindata,B1traindata,'kernel_function','rbf','RBF_Sigma', 4 ,'BoxConstraint', 1/4);
Group = svmclassify(SVMStruct,Xtestdata);
[X1,Y1,T,AUCtest] = perfcurve(B1testdata,Group,1);
disp ('SVM_C statistics on the test data after ten-fold cross validation');
disp (AUCtest);
Best,
Sarah
I'm using both Matlab and Rapidminer to do SVM classification with optimization for parameters. The data I used have 5000 obs, 36 integer attributes and one binomial label. I'm expecting similar results, yet they turned out to be different. The C statistics from Matlab is 0.672 while that from Rapidminer is 0.598. Also, they gives difference choices of optimal parameters for C and gamma. Rapidminer gives 0.25 and 0.25 respectively, and Matlab gives 4 and 0.25. I would greatly appreciate your help!
Below is the process code:
<运营商激活= " true "班ss="process" compatibility="5.3.008" expanded="true" name="Process">
<运营商激活= " true "班ss="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve donation_sarah5" width="90" x="45" y="30">
<运营商激活= " true "班ss="normalize" compatibility="5.3.008" expanded="true" height="94" name="Normalize" width="90" x="45" y="120"/>
<运营商激活= " true "班ss="split_data" compatibility="5.3.008" expanded="true" height="94" name="Split Data" width="90" x="179" y="255">
<运营商激活= " true "班ss="optimize_parameters_grid" compatibility="5.3.008" expanded="true" height="112" name="Optimize Parameters (Grid)" width="90" x="246" y="30">
<运营商激活= " true "班ss="x_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
<运营商激活= " true "班ss="support_vector_machine" compatibility="5.3.008" expanded="true" height="112" name="SVM" width="90" x="112" y="30">
<运营商激活= " true "班ss="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
<运营商激活= " true "班ss="performance_binominal_classification" compatibility="5.3.008" expanded="true" height="76" name="Performance" width="90" x="226" y="30">
<连接from_port = "模型" to_op =“应用模式”来说port="model"/>
<运营商激活= " true "班ss="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (2)" width="90" x="380" y="165">
<运营商激活= " true "班ss="performance_binominal_classification" compatibility="5.3.008" expanded="true" height="76" name="Performance (2)" width="90" x="581" y="255">
<连接from_op = from_port =“性能(2)考试ple set" to_port="result 3"/>
And here is my Matlab code:
clear all;
load donation;
Y = donation(:,40);
X = donation(:,2:36);
B = donation(:,40)> prctile(Y,80);
a_logical = logical( B );
B1 = a_logical + 0;
%check what percentage of donation are from the 20% people
ptg = sum(Y.*B1)/sum(Y)*100;
disp('percentage of donation that are from the top 20% people');
disp(ptg);
%randomly split the data into 80% and 20%
A = [X B1];
numA = size(A, 1);
trainsize = floor(0.8 * numA);
testsize = numA - trainsize;
ridx = randperm(numA);
traindata = A(ridx(1:trainsize),:);
testdata = A(ridx(trainsize + 1 : end),:);
Xtestdata = testdata(:,1:35);
B1testdata = testdata(:,36);
Xtraindata = traindata(:,1:35);
B1traindata = traindata(:,36);
n = size(B1traindata,1);
%cross-validation
%Gaussian Radial Basis Function kernel
L = [1/4 1 4];
AUCtrain = [];
for j = L(1:1:3)
for m = L(1:1:3)
indices = crossvalind('Kfold', n, 10);
Bp = [];
Br = [];
for i = 1:10
test = (indices == i); train = ~test;
xtst = Xtraindata(test,:);
ytst = B1traindata(test,:);
xtr = Xtraindata(train,:);
ytr = B1traindata(train,:);
SVMStruct = svmtrain(xtr,ytr,'kernel_function','rbf','RBF_Sigma', j ,'BoxConstraint', m);
Group = svmclassify(SVMStruct,xtst);
Bp = [Bp; Group];
Br = [Br; ytst];
end
[X1,Y1,T,AUCij] = perfcurve(Br,Bp,1);
AUCtrain = [AUCtrain;AUCij];
end
end
disp ('SVM_C statistics on the training data with ten-fold cross validation');
disp (AUCtrain');
%use the optimal parameter for the testdata
SVMStruct = svmtrain(Xtraindata,B1traindata,'kernel_function','rbf','RBF_Sigma', 4 ,'BoxConstraint', 1/4);
Group = svmclassify(SVMStruct,Xtestdata);
[X1,Y1,T,AUCtest] = perfcurve(B1testdata,Group,1);
disp ('SVM_C statistics on the test data after ten-fold cross validation');
disp (AUCtest);
Best,
Sarah
Tagged:
0
Answers
I like your question and probably myself will ask a ton of such questions soon.
However I program in Mathematica and I think for these sorts of classes of algorithm you CANNOT get the same results from different implementations.
CORRECT ME IF I AM WRONG: At the heart of simplest SVM there is a quadratic linear inequality which might have multiple solutions! One implementation hits one and another implementation hits the other one.
If you have a very symmetric separable data i.e. they can be separated using the linear forms, there will be several such separations. Take any crystal form like data that is symmetric and same in multiple views.
Dara
However, I still couldn't convince myself that the classification performance, which is measured in C statistics in this case, can differ that much between these two software. I'm wondering if anyone knows how I can modify process in Rapidminer to make the results/algorithm comparable to those in Matlab.
It concerns me a lot 'cause I'm using SVM for a research project, and I would like to know if the different results are due to the wrong set-up I have in Rapidminer.
Thanks,
Sarah
I will be watching this thread carefully, because it is also an important issue for us in future. For example, I use Mathematica to do the research end of our work and do not like to see mismatch results as such.
But I fooled myself a couple of times thinking the results should be the same and time and time there was a subtle difference between the two usages of the same algorithm which made a huge difference.
D
In Mathematica for the solvers of the equations and inequalities there are some cases that used a random seed to start the looping towards the solutions.
The problem is if you change the seed the solutions vary!
I just did that to solve large system of equations. I am sure Matlab is similar.
So in the algorithm if there is a SOLVE procedure of some kind different software might use exactly the same code, but the rand seed is different, thus you get another solution.
D
Best regards,
Marius
I later found out that it could be due to the way how Matlab defines parameter C. In case C is a scalar, it is automatically rescaled by N/(2*N1) for the data points of group one and by N/(2*N2) for the data points of group two, where N1 is the number of elements in group one, N2 is the number of elements in group two, and N = N1 + N2. According to Matlab, this rescaling is done to take into account unbalanced groups.
I'm wondering if I could do the same thing to rescale C in Rapidminer. It appears to me that there's a EqualLabelWeighting process that may work. By any chance, do you know how it can be specifically applied to this situation if it's the right process to look into?
Thanks,
Sarah
Best regards,
Marius
Thanks,
Sarah