"Training a Text classification model with more than 1 training data inputs/outputs"

svtorykhsvtorykh MemberPosts:35Guru
edited June 2019 inHelp

Hi All!

I have a process that is able to take text column and topic column, build the model and train it. I then can use this model to assign topic to new text.

However, I now need to be able to have 2nd topic column added and may be 3rd and 4th to tell the model that this also can be an option. For example text document can contain both Innovation and Teamwork topics. So I want the model to recognize those 2 topics in a piece of text and then provide the output accordingly. Any idea how to implement it in Rapidminer?

Thanks much for support!

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959社区经理

    hello@svtorykh- welcome to the Community. Can you please post the XML from your process using the button?

    Thanks.

    Scott

  • svtorykhsvtorykh MemberPosts:35Guru

    Hi Scott,

    Here you go!









    <参数键= " random_seed " value = " 2001 " / >






























































    < connect from_port="document" to_op="Tokenize" to_port="document"/>
    < connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    < connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    < connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
    < connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Transform Cases" to_port="document"/>
    <连接from_op = "变换情况s" from_port="document" to_op="Stem (Porter)" to_port="document"/>
    < connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
















































    <参数键=“分布”值= "自动" / >







    < connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
    <连接from_op = "梯度增加Trees" from_port="model" to_port="model"/>












    < connect from_port="model" to_op="Apply Model" to_port="model"/>
    < connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    < connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    < connect from_op="Performance" from_port="performance" to_port="performance 1"/>



    < portSpacing端口= " sink_test设置结果“间距="0"/>
































































    < connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
    < connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
    < connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
    < connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
    < connect from_op="Filter Stopwords (3)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
    <连接from_op = "变换情况s (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
    < connect from_op="Stem (2)" from_port="document" to_port="document 1"/>












    < connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
    < connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
    < connect from_op="Process Documents from Data" from_port="word list" to_op="Store" to_port="input"/>
    < connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    < connect from_op="Set Role" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    < connect from_op="Cross Validation" from_port="model" to_op="Store (2)" to_port="input"/>
    < connect from_op="Cross Validation" from_port="example set" to_port="result 1"/>
    < connect from_op="Cross Validation" from_port="test result set" to_port="result 2"/>
    < connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
    < connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    < connect from_op="Retrieve Model" from_port="output" to_op="Apply Model (2)" to_port="model"/>







  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959社区经理

    hmm - not 100% sure I understand but basically the data you use to create your training set model has to be robust enough so that it can handle any data types that comes from your test set. Any unforeseen input types from your test set will naturally be classified poorly. Perhaps this helps?






































    < connect from_port="document" to_op="Tokenize" to_port="document"/>
    < connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    < connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    < connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
    < connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Transform Cases" to_port="document"/>
    <连接from_op = "变换情况s" from_port="document" to_op="Stem (Porter)" to_port="document"/>
    < connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>






















    < connect from_port="training set" to_op="Gradient Boosted Trees" to_port="training set"/>
    <连接from_op = "梯度增加Trees" from_port="model" to_port="model"/>









    < connect from_port="model" to_op="Apply Model" to_port="model"/>
    < connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    < connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    < connect from_op="Performance" from_port="performance" to_port="performance 1"/>



    < portSpacing端口= " sink_test设置结果“间距="0"/>


































    < connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
    < connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
    < connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
    < connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
    < connect from_op="Filter Stopwords (3)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
    <连接from_op = "变换情况s (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
    < connect from_op="Stem (2)" from_port="document" to_port="document 1"/>








    < connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
    < connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
    < connect from_op="Process Documents from Data" from_port="word list" to_op="Store" to_port="input"/>
    < connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    < connect from_op="Set Role" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    < connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    < connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
    < connect from_op="Retrieve Wordlist" from_port="output" to_op="Process Documents from Data (2)" to_port="word list"/>
    < connect from_op="Read Excel (2)" from_port="output" to_op="Process Documents from Data (2)" to_port="example set"/>
    < connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    < connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
    < connect from_op="Apply Model (2)" from_port="model" to_port="result 2"/>








  • svtorykhsvtorykh MemberPosts:35Guru

    Thanks for reply Scott.

    Let me rephrase the problem in business language.

    I have a set of text comments with 1 topic assigned for every comment in my data set.

    For example "I like doing my job" is tagged as Meaningful Work, "We have great innovative products" is tagged as Innovation. Comment is located in column A and Topic in column B. This data set is used as training data set to train the model. I can then take this model and assign same topics to comments that are not tagged. I.e. classify/categorize the text.

    Sometimes the comment may contain multiple topics, e.g. "I like my job as I'm continuously innovating every day". Ideally, I would tag this comment as both Meaningful Work and Innovation but second topic would need to come to column C. So my question is, how to build the model using training set with mutiple topics assigned to the same comment. Comment must stay as 1 row in excel with topics being added as columns. Is it clearer now?

    Regards,

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959社区经理

    Is this what you're talking about at themeetupnext week,@IngoRM? Or am I missing something obvious?

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder

    Nope. What we are looking for here is the operator "Generate Prediction Ranking". This operator from the Scoring group can be used after the operator "Apply Model" to identify the k most likely classes. Below is a simple example using the Iris data set which assigns the 2 most likely classes to each test example. This should do the trick.

    Of course you could follow this with an operator "Generate Attributes" so that you only keep one single class if the model is very confident (for example whenever "confidence_1" is higher than 90%, replace "class_2" by missing or something like that).

    Hope this helps,

    Ingo

























    < connect from_op="Retrieve Iris" from_port="output" to_op="Split Data" to_port="example set"/>
    < connect from_op="Split Data" from_port="partition 1" to_op="Naive Bayes" to_port="training set"/>
    < connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
    < connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
    < connect from_op="Apply Model" from_port="labelled data" to_op="Generate Prediction Ranking" to_port="example set input"/>
    < connect from_op="Generate Prediction Ranking" from_port="example set output" to_port="result 1"/>





    sgenzer
  • svtorykhsvtorykh MemberPosts:35Guru

    Thanks folks! Let me check it out!

Sign InorRegisterto comment.