[solved] Processing txt create sentiment analysis from mined twitter msgs
Hi,
I am working on a project to create a (simple) sentiment analysis from several large text files (between 2 and 20gb) from mined twitter messages.
I have no computer science background and just found rapidminer the other day. Now Iam curious if it will be possible to use for my purpose.
All tweets are stored in a simple text file in the following format:
T 2009-06-07 02:07:41
Uhttp://twitter.com/cyberplumber
W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE..http://tinyurl.com/5th9sw
I would like to create a sentiment index (positive / negative) for each single day.
The source for the sentiment index shall be as simple as possible. Therefore I thought to just define a few adjectives for each spectrum. Additionally / as a different index I would like to count the positive / negative smileys in each tweet.
As the dataset is in total 70gb I probably will have to create a (postgre sql) database first? I am right now trying to find a way to get the text files into a nice sql table (first time for me!). As my source is not a csv and instead of using commas they used "letter T/U/W and tab" for seperation I am also not quiet sure how to do this.
So my general question:
Is it possible to use rapidminer to perform this kind of sentiment analysis?
Is there maybe a viable option to use rapidminer for those large textfiles and circumvent creating a sql table (which has the difficulty of parsing the textfiles first).
Which tutorials / articles can you recommend me reading? (I found the vancouver data ones and they seem good)
If somebody here is willing to "coach" me for a couple hours to get me on track for my project in return for a small compensation ($20/hr) I would very much appreciate this. Just send me a msg to exchange skype.
Thank you for reading!
Edit:
Ok, I used the following python script to import the tweets to a postgresql:
#!/usr/bin/python
import sys
import psycopg2
db = psycopg2.connect(host="localhost", port=12345, database="db", user="postgres", password="pw")
class Tweet:
def Tweet(self, date, user, text):
self.date = date
self.user = user
self.text = text
def insert_into_db(tweet):
global db
print "insert ", tweet.date, tweet.user, tweet.text
try:
db.set_isolation_level(0)
cursor = db.cursor()
cursor.execute("""INSERT INTO tweets (timestamp, userid, tweet) VALUES (%s, %s, %s)""", (tweet.date, tweet.user, tweet.text))
db.commit()
except Exception as e:
print "ERROR", e
current = Tweet
def process_data(piece):
global current
for line in piece.split("\n"):
#print line
if (line.startswith("T\t")):
current.date = line[2:]
if (line.startswith("U\t")):
current.user = line[2 + len("http://twitter.com/"):]
if (line.startswith("W\t")):
current.text = line[2:]
insert_into_db(current)
current = Tweet
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(sys.argv[1])
for piece in read_in_chunks(f):
process_data(piece)
And I use the following structure in rapidminer (taken from bi cortex example):
<运营商激活= " true " class = "文本:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="30">
<运营商激活= " true " class = " apply_model“compatibility="5.3.007" expanded="true" height="76" name="Apply Model" width="90" x="45" y="75">
<运营商激活= " true " class = "文本:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="514" y="30">
<运营商激活= " true " class = " apply_model“compatibility="5.3.007" expanded="true" height="76" name="Apply Model (2)" width="90" x="648" y="345">
<连接from_op = "读数据库”from_port = " output" to_op="Set Role (3)" to_port="example set input"/>
I am working on a project to create a (simple) sentiment analysis from several large text files (between 2 and 20gb) from mined twitter messages.
I have no computer science background and just found rapidminer the other day. Now Iam curious if it will be possible to use for my purpose.
All tweets are stored in a simple text file in the following format:
T 2009-06-07 02:07:41
Uhttp://twitter.com/cyberplumber
W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE..http://tinyurl.com/5th9sw
I would like to create a sentiment index (positive / negative) for each single day.
The source for the sentiment index shall be as simple as possible. Therefore I thought to just define a few adjectives for each spectrum. Additionally / as a different index I would like to count the positive / negative smileys in each tweet.
As the dataset is in total 70gb I probably will have to create a (postgre sql) database first? I am right now trying to find a way to get the text files into a nice sql table (first time for me!). As my source is not a csv and instead of using commas they used "letter T/U/W and tab" for seperation I am also not quiet sure how to do this.
So my general question:
Is it possible to use rapidminer to perform this kind of sentiment analysis?
Is there maybe a viable option to use rapidminer for those large textfiles and circumvent creating a sql table (which has the difficulty of parsing the textfiles first).
Which tutorials / articles can you recommend me reading? (I found the vancouver data ones and they seem good)
If somebody here is willing to "coach" me for a couple hours to get me on track for my project in return for a small compensation ($20/hr) I would very much appreciate this. Just send me a msg to exchange skype.
Thank you for reading!
Edit:
Ok, I used the following python script to import the tweets to a postgresql:
#!/usr/bin/python
import sys
import psycopg2
db = psycopg2.connect(host="localhost", port=12345, database="db", user="postgres", password="pw")
class Tweet:
def Tweet(self, date, user, text):
self.date = date
self.user = user
self.text = text
def insert_into_db(tweet):
global db
print "insert ", tweet.date, tweet.user, tweet.text
try:
db.set_isolation_level(0)
cursor = db.cursor()
cursor.execute("""INSERT INTO tweets (timestamp, userid, tweet) VALUES (%s, %s, %s)""", (tweet.date, tweet.user, tweet.text))
db.commit()
except Exception as e:
print "ERROR", e
current = Tweet
def process_data(piece):
global current
for line in piece.split("\n"):
#print line
if (line.startswith("T\t")):
current.date = line[2:]
if (line.startswith("U\t")):
current.user = line[2 + len("http://twitter.com/"):]
if (line.startswith("W\t")):
current.text = line[2:]
insert_into_db(current)
current = Tweet
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(sys.argv[1])
for piece in read_in_chunks(f):
process_data(piece)
And I use the following structure in rapidminer (taken from bi cortex example):
<运营商激活= " true " class = "文本:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="30">
<运营商激活= " true " class = " apply_model“compatibility="5.3.007" expanded="true" height="76" name="Apply Model" width="90" x="45" y="75">
<运营商激活= " true " class = "文本:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="514" y="30">
<运营商激活= " true " class = " apply_model“compatibility="5.3.007" expanded="true" height="76" name="Apply Model (2)" width="90" x="648" y="345">
<连接from_op = "读数据库”from_port = " output" to_op="Set Role (3)" to_port="example set input"/>
Tagged:
0