Model Applier Output Misassinging Internal Mappings for Nominal Values
Following up the Model Applier problems of the past in terms of internal nominal mappings, I am still having problems! It seems that Rapidminer is having trouble with Nominal values that are not first in the list in the aml files with the model applier.
Following the work-around in the first step I load my training data (attached) from an excel file, write it out with ExampleSetWriter, load it back in with ExampleSource, create a model and then write the model:
In the third part I apply the model to a new instance of test data, for this run I have used the same temp.xls. This is what I call for my real world prediction stuff. I load the temp.xls, then using ExampleSetWriter I only write out the temp.dat file so as to preserve all of the correct attribute information copied in the workaround above. I have stuck in an IOConsumer just as a control method for testing.
I then load the test example using ExampleSource to load temp.aml. I have a FeatureIterator to scrub out any missing data which in our set is represented with 999, I load the model and apply it and then write out the prediction.
For example,
SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP UR SUCCESS
Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only 191.00
becomes this in the output:
UR SUCCESS SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS)
191.0 ? Current Long-Term ? ? ? CBT only Unsuccessful
confidence(Unsuccessful) confidence(Successful)
.7 .3
Now, lets take a look at the .aml files. You will notice below that the only nominal variable that is being written out is MARSTAT, Current Long-Term. It is the only nominal variable which appears [glow=red,2,300]FIRST[/glow] in the aml files. So at least for the writing out after the model applier only the first nominal variables are working.
It works! Confirming my theory.
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 Male Current Long-Term Uni Employed Rent CBT only Unsuccessful .7 .3
Now with a file where the first nominal value is never present (attached as tempallnotfirst, rename to temp to use) and as expected we have
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 ? ? ? ? ? ? Successful .4 .6
Now, going back to our original temp file we can take a look at the DataTable tab at the end of the experiment: Its a bit messy but I have highlighted a few examples of data that goes missing below for EDUC and EMPLOY. In both cases in the statistics column the mode is unknown! but the information is still available in the range column!!
id UR integer avg = 191 +/- 0 [191.000 ; 191.000] 0.0
prediction prediction(SUCCESS) nominal mode = Unsuccessful (1), least = Successful (0) Unsuccessful (1), Successful (0) 0.0
confidence_Unsuccessful confidence(Unsuccessful) real avg = 0.666 +/- 0 [0.666 ; 0.666] 0.0
confidence_Successful(成功)的信心avg = 0.334 +/- 0 [0.334 ; 0.334] 0.0
regular SEX nominal mode = unknown Female (0) 0.0
regular MARSTAT nominal mode = Current Long-Term (1), least = Current Long-Term (1) Current Long-Term (1) 0.0
regular EDUC nominal mode = unknown [glow=red,2,300]Senior (Yr 12) (0)[/glow] 0.0
regular EMPLOY nominal mode = unknown [glow=red,2,300]Unemployed[/glow] (0) 0.0
regular ACCOM nominal mode = unknown Own Home (0) 0.0
regular SF36PHY1 real avg = ? +/- ? [∞ ; -∞] 1.0
regular GROUP nominal mode = CBT only (1), least = CBT only (1) CBT only (1) 0.0
Now, the problem could be in producing the output from the model or in the actual model applier itself.
To try and test if the data is going missing in the model applier I ran the model applier process a few times, each time changing one of the suspect variables to a missing value and found the following predictions:
Original: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
Female Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EDUC Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EMPLOY Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
Well, I think you get the picture there. The data for these variables seems to be treated by the model applier as if it is missing.
Am I going mad? Have I missed something obvious?
我如何把我的数据文件吗?
Following the work-around in the first step I load my training data (attached) from an excel file, write it out with ExampleSetWriter, load it back in with ExampleSource, create a model and then write the model:
Next, I read in a test set consisting of a single example from an excel file (temp.xls) and write it out with the example set writer. I guess this step isn't strictly necessary but it is helpful in what is to come:
THIS PART IS THE WORKAROUND: I now manually open train.aml and temp.aml. I copy all of the attribute information from train.aml over the attribute information in temp.aml so that all of the attribute information in both files is exactly the same.
In the third part I apply the model to a new instance of test data, for this run I have used the same temp.xls. This is what I call for my real world prediction stuff. I load the temp.xls, then using ExampleSetWriter I only write out the temp.dat file so as to preserve all of the correct attribute information copied in the workaround above. I have stuck in an IOConsumer just as a control method for testing.
I then load the test example using ExampleSource to load temp.aml. I have a FeatureIterator to scrub out any missing data which in our set is represented with 999, I load the model and apply it and then write out the prediction.
Now here is the problem. The output file has only ? where there should be data!
For example,
SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP UR SUCCESS
Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only 191.00
becomes this in the output:
UR SUCCESS SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS)
191.0 ? Current Long-Term ? ? ? CBT only Unsuccessful
confidence(Unsuccessful) confidence(Successful)
.7 .3
Now, lets take a look at the .aml files. You will notice below that the only nominal variable that is being written out is MARSTAT, Current Long-Term. It is the only nominal variable which appears [glow=red,2,300]FIRST[/glow] in the aml files. So at least for the writing out after the model applier only the first nominal variables are working.
现在,让我们使用一个test set which only consists of first nominal values (attached as tempfirst, you will have to rename it to temp to use my code above).name = "SEX"
sourcecol = " 1 "
valuetype = "nominal">Male
<价值>女性< /值>name = "MARSTAT"
sourcecol = "2"
valuetype = "nominal">Current Long-Term Previous Long-Term Single name = "EDUC"
sourcecol = "3"
valuetype = "nominal">Uni Senior (Yr 12) Junior (Yr 10) Primary Tertiary (Non-Uni) name = "EMPLOY"
sourcecol = "4"
valuetype = "nominal">Employed Unemployed Student name = "ACCOM"
sourcecol = "5"
valuetype = "nominal">Rent Own Home Other name = "SF36PHY1"
sourcecol = "6"
valuetype = "real"/>name = "GROUP"
sourcecol = "7"
valuetype = "nominal">CBT only Combination Refuseniks Acamprosate St Judes Naltrexone name = "UR"
sourcecol = "8"
valuetype = "integer"/>
It works! Confirming my theory.
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 Male Current Long-Term Uni Employed Rent CBT only Unsuccessful .7 .3
Now with a file where the first nominal value is never present (attached as tempallnotfirst, rename to temp to use) and as expected we have
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 ? ? ? ? ? ? Successful .4 .6
Now, going back to our original temp file we can take a look at the DataTable tab at the end of the experiment: Its a bit messy but I have highlighted a few examples of data that goes missing below for EDUC and EMPLOY. In both cases in the statistics column the mode is unknown! but the information is still available in the range column!!
id UR integer avg = 191 +/- 0 [191.000 ; 191.000] 0.0
prediction prediction(SUCCESS) nominal mode = Unsuccessful (1), least = Successful (0) Unsuccessful (1), Successful (0) 0.0
confidence_Unsuccessful confidence(Unsuccessful) real avg = 0.666 +/- 0 [0.666 ; 0.666] 0.0
confidence_Successful(成功)的信心avg = 0.334 +/- 0 [0.334 ; 0.334] 0.0
regular SEX nominal mode = unknown Female (0) 0.0
regular MARSTAT nominal mode = Current Long-Term (1), least = Current Long-Term (1) Current Long-Term (1) 0.0
regular EDUC nominal mode = unknown [glow=red,2,300]Senior (Yr 12) (0)[/glow] 0.0
regular EMPLOY nominal mode = unknown [glow=red,2,300]Unemployed[/glow] (0) 0.0
regular ACCOM nominal mode = unknown Own Home (0) 0.0
regular SF36PHY1 real avg = ? +/- ? [∞ ; -∞] 1.0
regular GROUP nominal mode = CBT only (1), least = CBT only (1) CBT only (1) 0.0
Now, the problem could be in producing the output from the model or in the actual model applier itself.
To try and test if the data is going missing in the model applier I ran the model applier process a few times, each time changing one of the suspect variables to a missing value and found the following predictions:
Original: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
Female Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EDUC Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EMPLOY Missing: 1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
Well, I think you get the picture there. The data for these variables seems to be treated by the model applier as if it is missing.
Am I going mad? Have I missed something obvious?
我如何把我的数据文件吗?
0
Answers
If I enter the data for two instances into the temp file I get the following results:
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
132.0 Male Current Long-Term Uni Employed Rent 80.0 CBT only Unsuccessful .7 .3
191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only Unsuccessful .7 .3
Works beautifully!
Actually, I think I have tracked this bug a little further now.
If I enter the first 10 instances all together at once I get the following
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
132.0 Male Current Long-Term Uni Employed Rent 80.0 CBT only Unsuccessful .7 .3
191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only Unsuccessful .7 .3
360.0 Female Current Long-Term Senior (Yr 12) Employed Own Home 100.0 CBT only Unsuccessful .7 .3
1173.0 Female Current Long-Term Junior (Yr 10) Employed Own Home 90.0 Combination Unsuccessful .6 .4
1191.0 Female Current Long-Term Junior (Yr 10) Unemployed Own Home 50.0 Combination Successful .3 .7
1193.0 Female Previous Long-Term Junior (Yr 10) Unemployed Rent 85.0 Combination Successful .3 .7
13879.0 Male Current Long-Term Junior (Yr 10) Employed Rent 95.0 Refuseniks Unsuccessful .7 .3
14562.0 Female Previous Long-Term Junior (Yr 10) Unemployed Rent 100.0 CBT only Unsuccessful .7 .3
15655.0 Male Single Senior (Yr 12) Employed Rent 55.0 Combination Successful .3 .7
16126.0 Male Single Junior (Yr 10) Employed Own Home 90.0 Combination Unsuccessful .6 .4
They all work! But if I enter them individually one at a time we see the same behaviour as above in that variables are only displayed if they are the first in the nominal list. Individual results
132.0 Male Current Long-Term Uni Employed Rent 80.0 CBT only Unsuccessful .7 .3
191.0 ? Current Long-Term ? ? ? CBT only Unsuccessful .7 .3
360.0 ? Current Long-Term ? Employed ? 100.0 CBT only Unsuccessful .7 .3
1173.0 ? Current Long-Term ? Employed ? 90.0 ? Unsuccessful .6 .4
1191.0 ? Current Long-Term ? ? ? 50.0 ? Successful .3 .7
1193.0 ? ? ? ? Rent 85.0 ? Successful .3 .7
13879.0 Male Current Long-Term ? Employed Rent 95.0 ? Unsuccessful .7 .3
14562.0 ? ? ? ? Rent 100.0 CBT only Unsuccessful .7 .3
15655.0 Male ? ? Employed Rent 55.0 ? Successful .3 .7
16126.0 Male ? ? Employed ? 90.0 ? Unsuccessful .6 .4
The predictions made are all the same as above, so I hope that that is an indication that the predictions are being made correctly and using all of the data.
It gets interesting when you start looking at two instances together.
If we combine the first two instances then all of the second prints out correctly!
132.0 Male Current Long-Term Uni Employed Rent 80.0 CBT only Unsuccessful .7 .3
191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only Unsuccessful .7 .3
If we combine the second and third we get something quite horribly wrong. Now both employed and unemployed display BUT it is showing the wrong value for the wrong person!
191.0 ? Current Long-Term ? Employed ? CBT only Unsuccessful .7 .3
360.0 ? Current Long-Term ? Unemployed ? 100.0 CBT only Unsuccessful .7 .3
For these two the example example source loader shows
1 191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home NaN CBT only
2 360.0 Female Current Long-Term Senior (Yr 12) Employed Own Home 100.0 CBT only
Then the examplesetwriter
1 191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home NaN CBT only
2 360.0 Female Current Long-Term Senior (Yr 12) Employed Own Home 100.0 CBT only
At the examplesource breakpoint all is still well
1 191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home NaN CBT only
2 360.0 Female Current Long-Term Senior (Yr 12) Employed Own Home 100.0 CBT only
However, something goes horribly wrong when we reach the model applier breakpoint!
Looking at the data view tab (I truncated the first probability)
1 191.0 Unsuccessful 0.66578947 0.33421052631578946 ? Current Long-Term ? Employed ? NaN CBT only
2 360.0 Unsuccessful 0.66578947 0.33421052631578946 ? Current Long-Term ? Unemployed ? 100.0 CBT only
EMPLOYED HAS SWITCHED INSTANCES!
The log states the following which seems ok:
May 24, 2009 2:16:00 PM: [NOTE] ExcelExampleSource: Breakpoint reached
P May 24, 2009 2:16:58 PM: [NOTE] ExampleSetWriter: Breakpoint reached
P May 24, 2009 2:17:37 PM: [NOTE] ExampleSource: Breakpoint reached
P May 24, 2009 2:18:17 PM: [NOTE] ModelLoader: Breakpoint reached
P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'SEX', training: 2, application: 1
P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'MARSTAT', training: 3, application: 1
P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EDUC', training: 5, application: 1
P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EMPLOY', training: 3, application: 2
P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'ACCOM', training: 3, application: 1
P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'GROUP', training: 6, application: 1
P May 24, 2009 2:18:31 PM: [NOTE] ModelApplier: Breakpoint reached
If we combine the first and third instances everything seems ok again
132.0 Male Current Long-Term Uni Employed Rent 80.0 CBT only Unsuccessful .7 .3
360.0 Female Current Long-Term Senior (Yr 12) Employed Own Home 100.0 CBT only Unsuccessful .7 .3
If I send the last three then the output all goes incredibly wrong, variables are being outputted with some sort of order, but it is not the same order in which they were inputted!
input:
Female Previous Long-Term Junior (Yr 10) Unemployed Rent 100 CBT only 14562.00
Male Single Senior (Yr 12) Employed Rent 55 Combination 15655.00
Male Single Junior (Yr 10) Employed Own Home 90 Combination 16126.00
output:
14562.0 Male Single ? Employed Rent 100.0 CBT only Unsuccessful .7 .3
15655.0 Female ? Senior (Yr 12) Unemployed Rent 55.0 Combination Successful .3 .7
16126.0 Female ? ? Unemployed Own Home 90.0 Combination Unsuccessful .6 .4
Finally if we add the first case back on top
132.0 Male Current Long-Term Uni Employed Rent 80.0 CBT only Unsuccessful .7 .3
14562.0 Female Previous Long-Term Senior (Yr 12) Unemployed Rent 100.0 CBT only Unsuccessful .7 .3
15655.0 Male Single Junior (Yr 10) Employed Rent 55.0 Combination Successful .3 .7
16126.0 Male Single Senior (Yr 12) Employed Own Home 90.0 Combination Unsuccessful .6 .4
Most of them are correct except that EDUC has been flipped.
So, in summary, it seems that the model applier is working as the results are consistent,
numerical values are fine,
nominal values are being assigned to the wrong category if instances are entered one by one or in small groups in the model output stage of proceedings!
It is a pain that attachments have been disabled, even on personal messages, so it will be difficult to replicate your problem unless you email me the data.
One thought which may have relevance is this. Sticking a question mark in to indicate missing data doesn't indicate missing data to RM, like this.... Whereas putting nothing in the "replace_by" slot does. I realise that this may be completely irrelevant, but it is difficult to tell without the data. Still the point to take is that "?" is a nominal value as far as RM sees it.
Ooops, time for Sunday lunch and copious grog, better zoom off.
I am happy to email the data I used for the test to anyone who is interested in taking a look.
I tried to put nothing in the replace with box but it didn't appear to be replacing anything, just passing it through.
To make a test of this I took the same 10 cases as I have been using for prediction above but replaced all of the numeric values for the only numeric variable with 999 which is being used for missing values.
For the first run I used question mark in the replace with box and for the second run I left the replace with box empty. Only 2 predictions changed under these 2 conditions.
For ? we have:
1173.0 Female Current Long-Term Junior (Yr 10) Employed Own Home Combination Successful .5 .5
13879.0 Male Current Long-Term Junior (Yr 10) Employed Rent Refuseniks Successful .5 .5
(It should be noted here that no values were outputted for the numeric variable)
These are the same values that I get when I leave the numeric values empty:
1173.0 Female Current Long-Term Junior (Yr 10) Employed Own Home Combination Successful .5 .5
13879.0 Male Current Long-Term Junior (Yr 10) Employed Rent Refuseniks Successful .5 .5
Using an empty replace with
1173.0 Female Current Long-Term Junior (Yr 10) Employed Own Home 999.0 Combination Unsuccessful .6 .4
13879.0 Male Current Long-Term Junior (Yr 10) Employed Rent 999.0 Refuseniks Successful .0 1.0
(So you can see that it spits out 999s here, seemingly not replaced and the predictions are different)
When I use a value of 998 for the numeric variables the predictions are the same as in the 999 case.
1173.0 Female Current Long-Term Junior (Yr 10) Employed Own Home 998.0 Combination Unsuccessful .6 .4
13879.0 Male Current Long-Term Junior (Yr 10) Employed Rent 998.0 Refuseniks Successful .0 1.0
所以在这种情况下真的看起来像吗?是活动ng as a missing value but with nothing in the replace with box it does not replacing and therefore is interpreting the 999 as the number 999.
Are you sure that an empty replace with box really works?
Does it do different things when applied to numeric and nominal values?