"Training a Text classification model with more than 1 training data inputs/outputs"
Hi All!
I have a process that is able to take text column and topic column, build the model and train it. I then can use this model to assign topic to new text.
However, I now need to be able to have 2nd topic column added and may be 3rd and 4th to tell the model that this also can be an option. For example text document can contain both Innovation and Teamwork topics. So I want the model to recognize those 2 topics in a piece of text and then provide the output accordingly. Any idea how to implement it in Rapidminer?
Thanks much for support!
Tagged:
0
Answers
hello@svtorykh- welcome to the Community. Can you please post the XML from your process using the > button?
Thanks.
Scott
Hi Scott,
Here you go!
hmm - not 100% sure I understand but basically the data you use to create your training set model has to be robust enough so that it can handle any data types that comes from your test set. Any unforeseen input types from your test set will naturally be classified poorly. Perhaps this helps?
Thanks for reply Scott.
Let me rephrase the problem in business language.
I have a set of text comments with 1 topic assigned for every comment in my data set.
For example "I like doing my job" is tagged as Meaningful Work, "We have great innovative products" is tagged as Innovation. Comment is located in column A and Topic in column B. This data set is used as training data set to train the model. I can then take this model and assign same topics to comments that are not tagged. I.e. classify/categorize the text.
Sometimes the comment may contain multiple topics, e.g. "I like my job as I'm continuously innovating every day". Ideally, I would tag this comment as both Meaningful Work and Innovation but second topic would need to come to column C. So my question is, how to build the model using training set with mutiple topics assigned to the same comment. Comment must stay as 1 row in excel with topics being added as columns. Is it clearer now?
Regards,
Is this what you're talking about at themeetupnext week,@IngoRM? Or am I missing something obvious?
Nope. What we are looking for here is the operator "Generate Prediction Ranking". This operator from the Scoring group can be used after the operator "Apply Model" to identify the k most likely classes. Below is a simple example using the Iris data set which assigns the 2 most likely classes to each test example. This should do the trick.
Of course you could follow this with an operator "Generate Attributes" so that you only keep one single class if the model is very confident (for example whenever "confidence_1" is higher than 90%, replace "class_2" by missing or something like that).
Hope this helps,
Ingo
Thanks folks! Let me check it out!