Apply Model: Testing & Training Sets Differ
Hi
I am using Sentiment 140 as my training and testing data. They have already split the data into two sets. I am performing training, cross validation and testing all separately. Training and CV on the training set and testing on the testing set. The problem I have is that after text preprocessing, the features in the test set don't align with those of the training set and therefore I can't apply the trained model. In text preprocessing, my end product is a matrix where texts are the examples and the features are aligned to the term frequencies which will be different for the training and test sets.
Do I somehow merge both sets so that the features are aligned and TF = 0?
Thanks
I am using Sentiment 140 as my training and testing data. They have already split the data into two sets. I am performing training, cross validation and testing all separately. Training and CV on the training set and testing on the testing set. The problem I have is that after text preprocessing, the features in the test set don't align with those of the training set and therefore I can't apply the trained model. In text preprocessing, my end product is a matrix where texts are the examples and the features are aligned to the term frequencies which will be different for the training and test sets.
Do I somehow merge both sets so that the features are aligned and TF = 0?
Thanks
0
Best Answers
-
Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635UnicornThe word list elements will be constrained but the TF-IDF values will be recalculated on the new sample in Process Documents.5
-
jacobcybulski Member, University ProfessorPosts:391UnicornBe careful here, if your text processing in training uses pruning, make sure that in testing not only you use your saved word list to constrain the terms used in TF-IDF vector, as suggested by@Telcontar120, but you must switch off pruning, or else your word list may be shrunk in the pruning process thus rendering the two sets incompatible when applying the model to a test data.
5 -
jacobcybulski Member, University ProfessorPosts:391UnicornI have noticed now that you reduce dimensionality with weight-select method, in which case pass the list of weights to your testing branch, in which you do not need the weighing operator and you use the select using the weights from training.
5
Answers
This works, using the word output of the training leg but what if I am processing that information after the process docs operator and reducing features by using a select by weight operator?