Binary text classification - Help in process needed.

thiemo · 2016年12月

Hey guys,

We want to do a binary classification on a text data set with the distribution 80% negative class, 20% positive class. In order to reach maximum statistical meaningfulness, we want to do so by using 10-fold cross validation.

If we model this within Rapidminer, we are unsuccessful since it doesn’t output any statistical metrics (like precision, recall, etc):

Bildschirmfoto 2016-12-01 um 12.14.37.png

We found a workaround that works, but it doesn’t make any sense out of a ML perspective: If we first divide into training or test and then use 10-fold-crossvalidation it works — But the training or test split should be part of the crossvaligdation (9 training folds, 1 test fold, 10 iterations). So right now the only way to get this working is by FIRST dividing into test and training and THEN use X-Validation. Did we model it the right way or did we miss anything?

Bildschirmfoto 2016-12-01 um 12.14.37.png

If you need any more information for helping us, just comment.

Thank you very much in advanced.

Best regards!

Thomas_Ott · 2016年12月

Ok, silly questions but did you set a label role in your data set?

Telcontar120 · 2016年12月

This sounds like a strange problem, but it's very hard to troubleshoot from a screenshot of a process--can you post the process itself for review? You can export it from the file menu and attach it as a file.

Thanks,

thiemo · 2016年12月

Hey T-Bone,

yes I set a label role

Regards,

thiemo · 2016年12月

Hey Brian,

thank you for your answer.

Here is the process which gives me results but makes no sense

It would be great if you could help me. If you need any more information I am happy to provide them

Best regards,
Thiemo

Thomas_Ott · 2016年12月

I would double check your process, something doesn't appear to be correct because I can easily extract P/R's and confusion matrix.

看到示例XML below. This process takes Tweets, does a bit of processing up front and generates a random label. The Process Documents from Data operator then processes them to TF-IDF (you can select Binary Occurances) and spits out the confusion matrix.



















































<描述一致= =“绿色”“左”颜色色= " true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)

















The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.

A cross-validation evaluating a decision tree model.

Telcontar120 · 2016年12月

Hi@thiemo,

I took your original process, and modified it only by inputting a simple toy example set using the identical Excel format (since I don't have your original dataset). Then I removed your outer split validation, and ran it again only using the cross-validation that you had as an inner operator. And it works fine! Here's the modified process. So if you are having problems, I suspect it must be something strange related to your original dataset. There's nothing that appears to be wrong with the process or with the cross-validation operator. Sorry I couldn't be more definitive.

Telcontar120 · 2016年12月

And here's the Excel file I used as input in case you are interested.

thiemo · 2016年12月

Hey Brian,

thank you very much for your solution. I downloaded the process and the excel and tried it and it works perfectly, but I do not get the performance parameters such as accurancy, recall, precision and the AUC?

How can I use this process and receive those 4 parameters?

Regards,

Thiemo

Telcontar120 · 2016年12月

Hi@thiemo,

I'm not sure what you mean--those performance metrics are all available in the performance tab output from the process when it runs. See the attached screenshot. This is part of the output for the process I supplied with no changes. Of course, the values are useless with my test examples since there are only 10 of them, but you can see that AUC, accuracy, precision, and recall are all available. If you run it on a larger dataset then they should all be there.

performance output.PNG

thiemo · 2016年12月

thiemo · 2016年12月

Hi Biran,

thanks again for the quick answer.

然而,如果我把你上传和美国的过程e the excel of you, I get an result but not the statistical paremeters such as precision and recall.

Bildschirmfoto 2016-12-03 um 15.33.04.png

Did you do anything special while importing the data? I just set the type of need data to binominal. What can I do to get the precison and recall for the data?

Thanks you and best regards,

Thiemo

thiemo · 2016年12月

Thomas_Ott · 2016年12月

What you see in the Statistics tab is just some basic descriptive statistics of your data set, there will be no P/R or confusion matrix because you didn't do any modeling yet. This view is similar to a summary or head command in Python/R.

You need to attached a Cross Validation operator with a machine learning algoritm emebded + performance operator to generate the P/R's and confusion matrix.

thiemo · 2016年12月

Hi T-Bone,

thank you for the answer.

Exaclty this was my intitial problem. If I add another corss validation operater with a performance operator around the actual process, then it makes no sense anymore, right?

Regards,

Thiemo

Thomas_Ott · 2016年12月

From that point in your process (where you show the staistics tab) now connect a Cross Validation operator (insert your algo in the Training side and an Apply Model and Performance operator) THEN connect the "Per" port on the Cross VAlidation to the Results port. This will out put the P/R's etc for you.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Binary text classification - Help in process needed.

Answers