"Text Processing: Select Attribute and Weights in Process Documents From Data"

Frank · April 2013

Hello,

I am working on a text classification problem where my input consists of news articles.
There are two text attributes; the title of a news article and the fulltext (ArticleText).
As part of my research i am investigating the effect of assinging different weights to the title and fulltext attributes. This is done in the "Process document from data" process with the property "Select Attribute and Weights". (I use TF-IDF weighitng)

I tested five cases of different weights: (4.0 Title, 1.0 ArticleText), (2.0 Title, 1.0 ArticleText), (1.0 Title, 1.0 ArticleText), (1.0 Title, 2.0 ArticleText), (1.0 Title, 4.0 ArticleText). Each weighting configuration in a different process.

I was hoping that the TF-IDF score of words originating from the title would be multiplied with the corresponding weight and the same for the fulltext. However no matter how i set the weights, the outputed document term matrices are all equal. Am i doing something wrong, or is there another way to achieve my goal?

I have attached a simplified proces, where the Title attribute has weight 4.0 and the articleText 1.0.

cheers,

Frank























































































<连接过滤器Stopwords from_op = "(4)" from_port="document" to_op="Stem (4)" to_port="document"/>

MariusHelf · April 2013

For me specifying weights works flawlessly. However, when you use TF/IDF the weights do not result in an operation as simple as multiplying the final values with the weight. What happens is that the words are counting according to their weight, i.e. if you specify a weight of 4.0 for an attribute, each word/token that appears in that attribute is counted four times. You'll see that if you switch from TF/IDF to term_occurences for the vector_creation parameter of Process Documents.

If your results are still always the same, maybe an operator that you use inside Process Documents dismisses those weights. To test that, please start with a very simple process, and iteratively add subsequent operators. That way you'll find out which operator breaks the weighting. If you can find out anything useful we would be very grateful if you posted your findings here.

Best regards,
Marius

Frank · April 2013

Hey Marius,

Thanks for your quick repsonse.
I followed your advice and tried eliminating operators one by one inside the Process Documents operator and quickly found out the problem.
When I remove the Stem(WordNet) operator the resulting term matrices are different and i suspect the weighting to work.
即使我用干细胞(雪球)resulitng术语matrices are different, so i probably use that as workaround.
Btw, i have installed WordNet version 2.1.

regards,

Frank

MariusHelf · April 2013

Ok, I'll forward it to the developers, then!

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Processing: Select Attribute and Weights in Process Documents From Data"

Answers