"Text Processing: Select Attribute and Weights in Process Documents From Data"

FrankFrank MemberPosts:4Contributor I
edited June 2019 inHelp
Hello,

I am working on a text classification problem where my input consists of news articles.
There are two text attributes; the title of a news article and the fulltext (ArticleText).
As part of my research i am investigating the effect of assinging different weights to the title and fulltext attributes. This is done in the "Process document from data" process with the property "Select Attribute and Weights". (I use TF-IDF weighitng)

I tested five cases of different weights: (4.0 Title, 1.0 ArticleText), (2.0 Title, 1.0 ArticleText), (1.0 Title, 1.0 ArticleText), (1.0 Title, 2.0 ArticleText), (1.0 Title, 4.0 ArticleText). Each weighting configuration in a different process.

I was hoping that the TF-IDF score of words originating from the title would be multiplied with the corresponding weight and the same for the fulltext. However no matter how i set the weights, the outputed document term matrices are all equal. Am i doing something wrong, or is there another way to achieve my goal?



I have attached a simplified proces, where the Title attribute has weight 4.0 and the articleText 1.0.




cheers,

Frank






















































































<连接过滤器Stopwords from_op = "(4)" from_port="document" to_op="Stem (4)" to_port="document"/>















Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    For me specifying weights works flawlessly. However, when you use TF/IDF the weights do not result in an operation as simple as multiplying the final values with the weight. What happens is that the words are counting according to their weight, i.e. if you specify a weight of 4.0 for an attribute, each word/token that appears in that attribute is counted four times. You'll see that if you switch from TF/IDF to term_occurences for the vector_creation parameter of Process Documents.

    If your results are still always the same, maybe an operator that you use inside Process Documents dismisses those weights. To test that, please start with a very simple process, and iteratively add subsequent operators. That way you'll find out which operator breaks the weighting. If you can find out anything useful we would be very grateful if you posted your findings here.

    Best regards,
    Marius
  • FrankFrank MemberPosts:4Contributor I
    Hey Marius,

    Thanks for your quick repsonse.
    I followed your advice and tried eliminating operators one by one inside the Process Documents operator and quickly found out the problem.
    When I remove the Stem(WordNet) operator the resulting term matrices are different and i suspect the weighting to work.
    即使我用干细胞(雪球)resulitng术语matrices are different, so i probably use that as workaround.
    Btw, i have installed WordNet version 2.1.

    regards,

    Frank
  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Ok, I'll forward it to the developers, then!

    Best regards,
    Marius
Sign InorRegisterto comment.