"Some bugs connected with data sampling"
When trying to sample data out of a big data set I stumbled over several errors connected with data sampling
Any clarifications are greatly appreciated
Best regards
Wolfgang
[attachment deleted by admin]
- 1. Changing rapidminer.general.randomseed = -1 in Tools – Preferences has not the desired effect: Reopening the Preferences window shows that rapidminer.general.randomseed is set to 1 instead to -1. When running ExampleSource with sample_ratio=0.5, you get always the same sequence.
- 2. Changing rapidminer.general.randomseed = -1 in .rapidminer/ 4_2_0_rapidminerrc.Windows XP works, now we get different samples in each run. However, a warning message appears when opening the Preferences dialog box: “Illegal value '-1' for parameter 'rapidminer.general.randomseed' has been corrected to '1'.” (???) Nevertheless, the system behaves still in the same way as if -1 is in effect.
- 3.改变现在rapidminer.general.randomseedthe Preferences dialog box to any positive value, e.g. 42, and "Apply" & "Save" leaves the random behaviour untouched (different samples in each run). Only when restarting RapidMiner, the new setting "42" takes effect >> the same sample A is produced in every run.
- 4. Changing rapidminer.general.randomseed in the Preferences dialog box to any other positive value, e.g. 84, and "Apply" & "Save" leaves the random behaviour untouched (same sample A in each run). Only when restarting RapidMiner, the new setting "84" takes effect >> a new and always same sample B is produced in every run.
- 5. When having rapidminer.general.randomseed = -1 only sample_ratio<1.0 will have the effect of generating different samples in each run. When sample_ratio=1.0 and sample_size=1000 (in a 50000-record dataset), then each run will produce the same sequence of 1000 records, not 1000 different records. So there seems to beno randomness in sample_size.
- 6.Most disturbing: If I use the operatorSamplingand set its parameter local_random_seed to any value different from -1, then any incoming dataset is reduced to 0 records on output, irrespective how large the sample_ratio is!!. This leaves me in a rather puzzled state ???
Am I really the first one noting this somewhat strange behaviour or am I doing something in an unexpected way? Isn't it strange that there is no way to achieve a "random random seed" by any means from the GUI, although the tooltip says, that -1 would do it?
Any clarifications are greatly appreciated
Best regards
Wolfgang
[attachment deleted by admin]
Tagged:
0
Answers
First of all: Is there a specific reason you refuse to work with the latest version of RapidMiner?
Second:
I can confirm that the changing of global random seed in the preferences dialog does not work.
Third:
If would use 4.4 you could set the parameters like this to gain different samples: kind regards,
Steffen
PS: A lot of bugs have been fixed since 4.2, so please consider the latest version
PPS: You deserve an award for your error descriptions. Clearly about average !
thanks again for your fast response. According to your hint I switched now to RapidMiner 4.4 (the reason for not using it in the first place was that I read some posts here in the forum relating to things which used to work in former versions but had some problems in 4.4), and it works so far very well on my platform.
I have to admit, that most of the data sampling bugs described above are gone in RapidMiner 4.4. Especially with 4.4 it is well possible to change rapidminer.general.randomseed to -1. The only items remaining can be considered not as bugs, but as features:
a) The setting of rapidminer.general.randomseed does not take effect immediately but only after a restart of RapidMiner (okay, this is notexactlythe behaviour you expect from an "Apply" button...). The reason for this might be that the operatorRoothas its own parameter random_seed which is only filled at startup (very probably from rapidminer.general.randomseed ). If you changeRoot's random_seed to -1, 42 or 84 you get immediately the desired effects.
[Perhaps something to work out in a further and future appendix of the documentation ...)
b)参数sampl仅存的是事实e_size in operatorExampleSourceproduces always the same sequence irrespective of what the global or local random seed actually is. But as I said, it can be considered as bug, not as feature...
So I apologize for bothering you with bugs mostly from earlier versions.
And thanks for the PPS
Best regards
Wolfgang
A clear feature
The reason for this behavior is that we try to prevent the loading of all examples and to skip most of them again which of course is necessary for large files. If you do not want an exact but only a rough number of examples you could use the parameter "sample_ratio" instead. Or just use one of the sampling operators after loading.
Thanks for the hints and cheers,
Ingo