"Text Processing: Select Attribute and Weights in Process Documents From Data"
Hello,
I am working on a text classification problem where my input consists of news articles.
There are two text attributes; the title of a news article and the fulltext (ArticleText).
As part of my research i am investigating the effect of assinging different weights to the title and fulltext attributes. This is done in the "Process document from data" process with the property "Select Attribute and Weights". (I use TF-IDF weighitng)
I tested five cases of different weights: (4.0 Title, 1.0 ArticleText), (2.0 Title, 1.0 ArticleText), (1.0 Title, 1.0 ArticleText), (1.0 Title, 2.0 ArticleText), (1.0 Title, 4.0 ArticleText). Each weighting configuration in a different process.
I was hoping that the TF-IDF score of words originating from the title would be multiplied with the corresponding weight and the same for the fulltext. However no matter how i set the weights, the outputed document term matrices are all equal. Am i doing something wrong, or is there another way to achieve my goal?
I have attached a simplified proces, where the Title attribute has weight 4.0 and the articleText 1.0.
cheers,
Frank
I am working on a text classification problem where my input consists of news articles.
There are two text attributes; the title of a news article and the fulltext (ArticleText).
As part of my research i am investigating the effect of assinging different weights to the title and fulltext attributes. This is done in the "Process document from data" process with the property "Select Attribute and Weights". (I use TF-IDF weighitng)
I tested five cases of different weights: (4.0 Title, 1.0 ArticleText), (2.0 Title, 1.0 ArticleText), (1.0 Title, 1.0 ArticleText), (1.0 Title, 2.0 ArticleText), (1.0 Title, 4.0 ArticleText). Each weighting configuration in a different process.
I was hoping that the TF-IDF score of words originating from the title would be multiplied with the corresponding weight and the same for the fulltext. However no matter how i set the weights, the outputed document term matrices are all equal. Am i doing something wrong, or is there another way to achieve my goal?
I have attached a simplified proces, where the Title attribute has weight 4.0 and the articleText 1.0.
cheers,
Frank
<连接过滤器Stopwords from_op = "(4)" from_port="document" to_op="Stem (4)" to_port="document"/>
Tagged:
0
Answers
If your results are still always the same, maybe an operator that you use inside Process Documents dismisses those weights. To test that, please start with a very simple process, and iteratively add subsequent operators. That way you'll find out which operator breaks the weighting. If you can find out anything useful we would be very grateful if you posted your findings here.
Best regards,
Marius
Thanks for your quick repsonse.
I followed your advice and tried eliminating operators one by one inside the Process Documents operator and quickly found out the problem.
When I remove the Stem(WordNet) operator the resulting term matrices are different and i suspect the weighting to work.
即使我用干细胞(雪球)resulitng术语matrices are different, so i probably use that as workaround.
Btw, i have installed WordNet version 2.1.
regards,
Frank
Best regards,
Marius