"Minimal use-case: YAGGA2, YAGGA"
dromiceiomimus
MemberPosts:4Contributor I
Hi all,
First let me thank the developers for this wonderful tool. I've already had great success with some models.
Now, I'm trying to get YAGGA2 to work. My actual application is more complex than what's presented here, but I'd like to figure out a minimal setup that results in YAGGA2 functioning correctly before trying to apply it there.
So, here's some example data:
Here's my process:
[CSV] -> [YAGGA2 (NN -> Apply Model -> Performance)]
Using YAGGA (not 2) this process will run, but no new attributes will be generated.
What am I doing wrong?
First let me thank the developers for this wonderful tool. I've already had great success with some models.
Now, I'm trying to get YAGGA2 to work. My actual application is more complex than what's presented here, but I'd like to figure out a minimal setup that results in YAGGA2 functioning correctly before trying to apply it there.
So, here's some example data:
a,b,ca and b will be our attributes, c will be our label. c is log10(max(abs(b-a),50)*a) -- presumably a good candidate for yagga2.
1, 1, 1.698970004
2, 13, 2
4, 26, 2.301029996
8, 40, 2.602059991
16, 55, 2.903089987
32, 71, 3.204119983
64, 88, 3.505149978
128,106,3.806179974
256,125,4.525511261
512,235,5.15174973
Here's my process:
[CSV] -> [YAGGA2 (NN -> Apply Model -> Performance)]
This consistently errors with "Process failed: Generation exception: 'java.lang.IllegalArgumentException: Duplicate attribute name: prediction(c)'". Attempting to remove this attribute anywhere in the above chain does no good.
<运营商激活= " true " class="process" compatibility="5.1.017" expanded="true" name="Process">
<运营商激活= " true " class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<运营商激活= " true " class="optimize_by_generation_yagga2" compatibility="5.1.017" expanded="true" height="94" name="Generate" width="90" x="246" y="30">
<运营商激活= " true " class="neural_net" compatibility="5.1.017" expanded="true" height="76" name="Neural Net" width="90" x="112" y="30">
<运营商激活= " true " class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model" width="90" x="246" y="30">
<运营商激活= " true " class="performance" compatibility="5.1.017" expanded="true" height="76" name="Performance" width="90" x="380" y="30"/>
Using YAGGA (not 2) this process will run, but no new attributes will be generated.
What am I doing wrong?
Tagged:
0
Answers
“应用模式”操作符添加新属性s to the example set and these are being passed to the upper level of the YAGGA operator. The second time round, the attributes are added again but duplicates happen.
One way to fix it is to use a cross validation operator inside the YAGGA operator. This leaves the example set alone and produces an averaged estimate of what the performance could be on unseen data.
regards
Andrew
Though, I must admit, I don't quite understand why. Makes enough sense to me. I wasn't thinking about the YAGGA operator's internal state and that being where the duplicates needed to not occur. Has me so confused.
Why does cross validation work but not cross validation (parallel)? Are there other operators I could use aside from normal cross validation there? Is there some other (I suppose, better) way to try to employ the YAGGA operator, am I going about that wrong from the beginning?
Any hints on those?
Cheers.
By using an internal cross validation (or split validation) you will get a better and more robust performance estimation anyway and you don't have to clean up yourself but this will be done automatically by the validation operator. So I also highly recommend to use either a cross validation or a single split validation inside of the YAGGA operators. The same is true for basically all wrapper approaches for feature selection, generation, weighting...
Hope that clarifies things a bit. In principle this should also be possible. You should, however, not nest different parallel algorithms, i.e. you should not nest a parallel cross validation into a parallel feature selection / generation, for example.
Yes, you could use X-Validation, Split Validation, Bootstrapping Validation, or Batch-X-Validation. If you are knowing what you are doing you could also create specialized subprocesses, but in that case you have to ensure to clean up the predictions yourself.
No, in principal you should be fine. The rest is more about parameter tuning. One tip though: I would try YAGGA2 on slightly bigger data sets since otherwise probably either no new and interesting attributes will be created or it will directly result in overfitting. In your case, log(a) is already highly correlated with the label c so any additional attribute does not really help...
By the way, there is also a sample process for YAGGA in the Sample repository delivered with RapidMiner: Sample/processes/04_attributes/19_YAGGA in the case you have not seen this one yet...
Cheers,
Ingo