"Averaging cross-validation results"
Hi,
I've a general and a RapidMiner-specific question concerning the cross-validation.
In the meta sample "07_EvolutionaryParameterOptimization" you are performing an evolutionary
parameter optimization for LibSVMLearner based on the performance results from a cross-validation.
Between the EvolutionaryParameterOptimization and XValidation operator you are using the operator
"IteratingPerformanceAverage" . Is it recommended to always use it in order to get more unbiased results?
If so, what is a typical value for the parameter "number_of_validations"?
I would expect that the "IteratingPerformanceAverage" operator modifies the random seed. In the sample mentioned
above it's not clear to me how this happens. The operator "Process" uses the fixed value of "2001" for the parameter
"random_seed". The operator "XValidation" uses "-1" for "local_random_seed", i.e. the global settings. So, it looks to
me that for all iterations of the cross-validation the same seed is used, namely 2001. Wouldn't it make more sense to
use "-1" for "random_seed" in "Process" to have each time a different seed for the validation?
Regards,
Paul
I've a general and a RapidMiner-specific question concerning the cross-validation.
In the meta sample "07_EvolutionaryParameterOptimization" you are performing an evolutionary
parameter optimization for LibSVMLearner based on the performance results from a cross-validation.
Between the EvolutionaryParameterOptimization and XValidation operator you are using the operator
"IteratingPerformanceAverage" . Is it recommended to always use it in order to get more unbiased results?
If so, what is a typical value for the parameter "number_of_validations"?
I would expect that the "IteratingPerformanceAverage" operator modifies the random seed. In the sample mentioned
above it's not clear to me how this happens. The operator "Process" uses the fixed value of "2001" for the parameter
"random_seed". The operator "XValidation" uses "-1" for "local_random_seed", i.e. the global settings. So, it looks to
me that for all iterations of the cross-validation the same seed is used, namely 2001. Wouldn't it make more sense to
use "-1" for "random_seed" in "Process" to have each time a different seed for the validation?
Regards,
Paul
Tagged:
0
Answers
Steffen
PS: I guess I have found the first topic for the wiki ;D
Maybe I got it wrong, but I think you meant here "-1" and not "2001", right? To my understanding you would
get always the same pseudo-random numbers when you use a fixed value != -1. Using -1 on the other hand
might be a problem when you want to have reproducible results since "always" different seeds are generated.
I think that the most suitable approach combined with the IteratingPerformanceAverage operator would be a mix
of both seed specifications: RapidMiner should perform the cross-validation 6-10 times with different seeds
which are however specified statically. Thus, the results would be reproducible each time you run your process but
on the other hand you would get an average over multiple seeds as validation results which are however not
completely biased to one specific seed.
Is there a way to tell RapidMiner to perform a cross-validation with a set of pre-defined seeds which have to
be defined manually?
Regards,
Paul
First of all: -1 means that you use the global random generator, which is (as specified in preferences) initialized with 2001
Then:
The global random generator is initialized with 2001 every time a process is executed (by clicking the arrow button). On the other hand the local generators are initialized with the specified seed (!= -1) every time the operator, where this seed has been specified, is executed. Hence the results are always reproducible.
To use self-specified seeds for IteratingPerformanceAverage, you can type as argument (RapidMiner Macros, powerful thingi, see thetutorial.pdffor more details ), which replace the seed with the number of current iteration (1,2,3,....)
I suggest to continue to play with the rapidminer example processes to see what I mean. I hope I didnt increase your confusion
Regarding Kohvai: Here is the link to its Ph.D. Thesis (http://ai.stanford.edu/~ronnyk/teza.pdf), where you can find a detailed discussion of the issue of validation. Long text, but fun to read.
hope this was helpful
Steffen
thank you for your help.
What I meant with "not reproducible results" was that using "-1" as global and local seed would always
yield different random numbers due to the system time which usually changes when a process is
executed multiple times.
Paul