How best to analyse tweets? (Also help with rule association problem)
A colleague and I are currently carrying out clustering (K-Means and DBscan) as well as rule association on about 30000 tweets for a project, unfortunately after many attempts we still find incoherent data or results which despite our best efforts has resulted in few conclusions about the data.
Other than sentiment analysis which I would like to carry out if I have time but is rather difficult (so I have been told) what else could I do?
I am having some difficulty in particular with rule association, I managed to carry out rule association on the text but I would also like to include the time the tweet was sent. Unfortunately when I carry out the process the rules include the words "Time_sent" without the time actually stated in the rules. How can I fix this?
Answers
I do a lot of Twitter analysis with the Text Mining extension, clustering and use association rules quite a bit. A large row count shouldn't scare you away it's all the tokens that you generate that'll slow the process down. Do you do a lot of pruning when you process? I spend at a lot of time in data prep and I selectively tokenize hashtags, links, and twitter handles.
Hi Tom, we did spend a lot of time preparing the data, I am not sure how well we did however we managed to reduce the number of columns of word attributes from about 7000 to 900/1000 for every document we processed.
I managed to make some sense of the rules of association I used however unfortunately it seems as though there is not much to say regarding the data.
The hashtag is not a problem, the data given contained only one distinct hashtag so we just removed the attribute, they were all related already luckily. I use a percentual pruning method in the document process (below percent = 0.09/0.1, above percent = 100)
我觉得虽然我取得了更多的进展my colleague who cannot make sense of the cluster data, I have also tried helping him but the data is quite strange. I am not sure how to help him.
Should I conduct sentiment analysis? Or is it not necessary?
Hi Tom, we did spend a lot of time preparing the data, I am not sure how well we did however we managed to reduce the number of columns of word attributes from about 7000 to 900/1000 for every document we processed.
I managed to make some sense of the rules of association I used however unfortunately it seems as though there is not much to say regarding the data.
The hashtag is not a problem, the data given contained only one distinct hashtag so we just removed the attribute, they were all related already luckily. I use a percentual pruning method in the document process (below percent = 0.09/0.1, above percent = 100)
我觉得虽然我取得了更多的进展my colleague who cannot make sense of the cluster data, I have also tried helping him but the data is quite strange. I am not sure how to help him.
Should I conduct sentiment analysis? Or is it not necessary?
I guess the question is, what's the ultimate goal of this analysis? That will help form which direction to take.