Defeated by a simple text analysis problem - user defined sentiment dictionary

尼克_sargent · August 2018

Hello all, I'm new here and to rapidminer so please be gentle.

I have been defeated by what seems like a simple text analysis problem - pretty much a word counting task - and spent a whole day on it. I have googled and looked through the forums to find an answer, but am still stuck. I figured I could do this in PERL, but it would be quicker in RapidMiner - but I guess I have been caught by the learning curve.

The task seems simple to me. I wish to load some text (for now one "document" will do, although I actually have about 10, each comprising around a hundred words).. Within that text I wish to identify some words/phrases as a "+1" and some others as "-1" score. I want to count the occurence of these words in the document and generate two numbers: A: the number of unique occurences and B: the sum of their scores. (What I ultimately want to calculate is a clarity score for the text, so these dictionary scores indicate whether a word aids or hinders clarity)

这似乎非常简单,我变成了下去to the Dictionary Based Sentiment operator. Built my process, but there is a problem using the "Apply" operator for DBS - namely it is "whited out" and says "deprecated" in the notes. It will not run. Seems like this would have been the perfect solution, but I can't use it.

So, I turned to the next solution as listed in the forum here:RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/

I got the example working, so started to modify the inputs to use my source data and dictionary:

The first problem I hit was the same as the user at the bottom of the above page: namely that if you add to the dictionary and your dictionary term is not actually in the source text, the Apply Model operator will not run, as it expects the dictionary to be asubset. (Which seems odd, since in the general sense a dictionary is a superset of all possible definitions). Anyway, I fixed that by laboriously editing down my dictionary against each source document.. (So already negated the purpose of trying to automate this!)

The process now runs, but the next problem is, theVector Linear Regression model does not calculate any values! In fact I just see this in its output (a bunch of ? instead of the scores from the dictionary):

and, of course, in the results the "prediction" (which all I want to be the sum of the "sentiment" values) is a bunch of "?'s"

I've tried for hours to figure out what is going wrong and why this won't work... is there a way to get this to work, or a simpler method, that would be as simple as using the "Dictionary Based Sentiment" operator?

Many thanks

尼克










<参数key="repository_entry" value="../data/deflection message semantic dictionary"/>




<参数key="Weight" value="1/Weight"/>

Invert all Weights for the Linear Regression



<参数key="group_attribute" value="id"/>
<参数key="index_attribute" value="phrase_ngram"/>
<参数key="skip_constant_attributes" value="false"/>


<参数key="replace_what" value="Weight_(.+)"/>
<参数key="replace_by" value="$1"/>


<参数key="default" value="zero"/>




<参数key="label" value="1"/>



<参数key="attribute_name" value="label"/>
<参数key="target_role" value="label"/>



<参数key="use_bias" value="false"/>


<参数key="repository_entry" value="../data/deflection message content OR10 ABCD"/>



<参数key="vector_creation" value="Term Occurrences"/>
<参数key="keep_text" value="true"/>



<参数key="mode" value="specify characters"/>
<参数key="characters" value=".: ,-"/>



<参数key="max_length" value="4"/>


































Built a table like<br><br>good ................. bad<br>1/1 ..................... 0<br>0 ......................... 1/-1.5
Generate a constant label of 1
Build and process test data

Telcontar120 · August 2018

This definitely looks like one for@mschmitzsince he is the creator of the latest DBSM.

But possibly your first problem (words present in the document but not in the dictionary) could be addressed via automation by first processing all your documents into a single corpus and then creating a unified wordlist from that. You can then use that wordlist to create your dictionary (and just leave the score of zero for words that you don't care about). You should then be able to use that same wordlist when processing all future texts (there is a wordlist import on the Process Documents operator specifically for this purpose), so you would never end up with a mismatch between the wordlist for the documents you want to score vs what was in your dictionary (until you decided to update that yourself).

Having said that I tend to agree with your point here---in principle, it would be expected that the dictionary could be a superset. Actually it would be very nice if terms could be missing from either the dictionary or the document list (so neither needs to be an exact subset of the other).

I will leave it to@mschmitz对向量重新回答你的第二个问题gression output, but to me it looks like the calculation is not performing correctly at all---maybe due to a data type mismatch?

MartinLiebig · August 2018

Hi,

quick answer: DBS is to be applied with the Apply Model (Documents) operator. I changed this a few versions ago. Where did we still reference the old Apply DBS operator?

BR,

Martin

尼克_sargent · August 2018

I would have found the DBS Apply operator by an extensive google search - no doubt a legacy post or blog article somewhere showing some examples - i can't find which one now as I looked at so many...

尼克_sargent · August 2018

非常感谢你的回复和搁浅船受浪摇摆estion. I will have a think about that and whether I can manage to craft something that does it.. Thing is, I can't predict future documents (although they are domain-specific), and i want a generic soution that will work on an as-yet unwritten and unseen future document. I.e. I would like to define my dictionary now, independently of any future documents.

So, perhaps some kind of join against my master dictionary and the word list from the document being analysed would be a way to strip of the dictionary words that are not needed for this document. I suppose I should have thought about that myself.

I will explore the Typing of my weighting as you suggest - although the source is just a regular numeric field in excel, so, I dunno, maybe I should make 1 into 1.0 or something?

尼克_sargent · August 2018

Just to add - I know got the calculations working - for some reason the "1/Weight" calculation was the culprit (I can't say I understand why that is there). For now, I changed it to "1*weight" and the calculation now seems to work... I haven't checked it for accuracy but it is at least now producing numerical values.

MartinLiebig · August 2018

Hi@nik_sargent,

did you get DBS working using the Apply Model opertor?

BR,

Martin

尼克_sargent · August 2018

@mschmitzHi Martin, yes. apologies - I implied it, but didnt' state it in the above reply.

I'm still irked (not in an angry way,just bemused) that the dictionary cannot be arbitary; so now that the algorithm is basically working I'm hoping I can build a process to reduce the dictionary on teh fly to only those terms in the documents, to prevent the "subset" errors with the "apply model"

I wonder, is there any chance of you giving (or point me to) a brief explanation of why the linear vector regression model works for this problem? One thing that has confused me is that it is giving non-integer results. (e.g. 17.99999800007)
So, although my sentiment weights are only ever 1 or -1, it can't just be calculating: prediction = termAweight * termApresence + termBweight * termBpresence.... etc. ????

thanks

尼克

Telcontar120 · August 2018

You don't need a join to handle the issue of the master dictionary and future documents. The Process Documents operator already has a wordlist input port. When you connect a wordlist to that port, it processes the new documents only with respect to that existing wordlist. So once you define the wordlist for your dictionary (as described in my previous post) you can use that wordlist on any future documents. Words that they might contain that are not in the existing wordlist will simply be ignored.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Defeated by a simple text analysis problem - user defined sentiment dictionary

Answers