I think that sentiment analysis is one of the things I would like to do when conducting Twitter analysis. (Isn't it?) I think there are various methods, but I would like to start with the simplest example and see how it gradually becomes more advanced (preferably).
The data to be analyzed will be Twitter again. However, until now, it was acquired through Twitter REST APIs, but this time it was acquired through Twitter Stream API. (/ streaming / overview) I would like to import the Twitter data, and then digitize the emotion analysis results and store them in the database.
Please refer to Previous article for the explanation of getting Twitter data to mongodb.
First of all, it is a preparation. It imports various libraries, declares utility functions, and connects to the DB.
from requests_oauthlib import OAuth1Session
from requests.exceptions import ConnectionError, ReadTimeout, SSLError
import json, time, exceptions, sys, datetime, pytz, re, unicodedata, pymongo
import oauth2 as oauth
import urllib2 as urllib
import MeCab as mc
from collections import defaultdict
from pymongo import MongoClient
from httplib import IncompleteRead
import numpy as np
import logging
from logging import FileHandler, Formatter
import logging.config
connect = MongoClient('localhost', 27017)
db = connect.word_info
posi_nega_dict = db.posi_nega_dict
db2 = connect.twitter
streamdata = db2.streamdata
def str_to_date_jp(str_date):
dts = datetime.datetime.strptime(str_date,'%a %b %d %H:%M:%S +0000 %Y')
return pytz.utc.localize(dts).astimezone(pytz.timezone('Asia/Tokyo'))
def mecab_analysis(sentence):
t = mc.Tagger('-Ochasen -d /usr/local/Cellar/mecab/0.996/lib/mecab/dic/mecab-ipadic-neologd/')
sentence = sentence.replace('\n', ' ')
text = sentence.encode('utf-8')
node = t.parseToNode(text)
result_dict = defaultdict(list)
for i in range(140): #Since it is a tweet, MAX 140 characters
if node.surface != "": #Exclude headers and footers
word_type = node.feature.split(",")[0]
if word_type in ["adjective", "verb","noun", "adverb"]:
plain_word = node.feature.split(",")[6]
if plain_word !="*":
result_dict[word_type.decode('utf-8')].append(plain_word.decode('utf-8'))
node = node.next
if node is None:
break
return result_dict
def logger_setting():
import logging
from logging import FileHandler, Formatter
import logging.config
logging.config.fileConfig('logging_tw.conf')
logger = logging.getLogger('filelogger')
return logger
logger = logger_setting()
KEYS = { #List the keys you got with your account below
'consumer_key':'**********',
'consumer_secret':'**********',
'access_token':'**********',
'access_secret''**********',
}
logging_tw.conf
# logging_tw.conf
[loggers]
keys=root, filelogger
[handlers]
keys= fileHandler
[formatters]
keys=logFormatter
[logger_root]
level=DEBUG
handlers=fileHandler
[logger_filelogger]
level=DEBUG
handlers=fileHandler
qualname=filelogger
propagate=0
[handler_fileHandler]
class=handlers.RotatingFileHandler
level=DEBUG
formatter=logFormatter
args=('logging_tw.log',)
[formatter_logFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
datefmt=
Created by Tohoku University's Laboratory of Professor Inui and Professor Okazaki Emotions are quantified using the Japanese evaluation polarity dictionary, so first of all
From here Download and store in the same folder as .py To do.
Regarding the Japanese evaluation polarity dictionary (words), the positive term is quantified as 1 and the negative term is -1 and imported into mongodb. For the Japanese evaluation polarity dictionary (noun edition), the term p is quantified as 1, the term e is quantified as 0, and the term n is -1 and imported into mongodb. Below is the code.
#Importing positive and negative dictionaries of words into mongoDB
#Japanese Evaluation Polar Dictionary (Terms) ver.1.Import 0 (December 2008 version) to mongodb
#Positive term is 1,Negative terminology-Quantify as 1
with open("wago.121808.pn.txt", 'r') as f:
for l in f.readlines():
l = l.split('\t')
l[1] = l[1].replace(" ","").replace('\n','')
value = 1 if l[0].split('(')[0]=="Positive" else -1
posi_nega_dict.insert({"word":l[1].decode('utf-8'),"value":value})
#Japanese Evaluation Polar Dictionary (Noun Edition) ver.1.Import 0 (December 2008 version) to mongodb
#The term p is 1 The term e is 0,The term of n is-Quantify as 1
with open("pn.csv.m3.120408.trim", 'r') as f:
for l in f.readlines():
l = l.split('\t')
if l[1]=="p":
value = 1
elif l[1]=="e":
value = 0
elif l[1]=="n":
value = -1
posi_nega_dict.insert({"word":l[0].decode('utf-8'),"value":value})
Since we were able to create a database of emotional values for each word in 1-2, we will add processing that can convert sentences into emotional values. However, this time it is "elementary" so nothing is done and
{\rm sentiment\, value\, of\, the\, sentence} \, = \, \frac{1}{n}\sum_{i=1}^{n} x_i
Therefore, the emotional value can be derived between -1 and 1 regardless of the number of words in the sentence, and comparison between sentences becomes possible.
#Emotion level setting(Include in dictionary object for hash search to speed up search)
pn_dict = {data['word']: data['value'] for data in posi_nega_dict.find({},{'word':1,'value':1})}
def isexist_and_get_data(data, key):
return data[key] if key in data else None
# -Returns the emotional value for a given sentence (word list) in the range 1 to 1.(1:Most positive,-1:Most negative)
def get_setntiment(word_list):
val = 0
score = 0
word_count = 0
val_list = []
for word in word_list:
val = isexist_and_get_data(pn_dict, word)
val_list.append(val)
if val is not None and val != 0: #If found, add the scores and count the words
score += val
word_count += 1
logger.debug(','.join(word_list).encode('utf-8'))
logger.debug(val_list)
return score/float(word_count) if word_count != 0. else 0.
This is the code to download the tweet data from the Twitter Stream API. While downloading tweets, morphological analysis is performed with MeCab, and each word is separated and listed by noun, verb, adjective, and adverb. So-called Bag of Words.
Then, the emotion value is derived by the function get_setntiment () defined earlier for those words, and it is stored in mongodb together with this.
# -----Stream data import---------#
consumer = oauth.Consumer(key=KEYS['consumer_key'], secret=KEYS['consumer_secret'])
token = oauth.Token(key=KEYS['access_token'], secret=KEYS['access_secret'])
url = 'https://stream.twitter.com/1.1/statuses/sample.json'
params = {}
request = oauth.Request.from_consumer_and_token(consumer, token, http_url=url, parameters=params)
request.sign_request(oauth.SignatureMethod_HMAC_SHA1(), consumer, token)
res = urllib.urlopen(request.to_url())
def get_list_from_dict(result, key):
if key in result.keys():
result_list = result[key]
else:
result_list = []
return result_list
cnt = 1
try:
for r in res:
data = json.loads(r)
if 'delete' in data.keys():
pass
else:
if data['lang'] in ['ja']: #['ja','en','und']:
result = mecab_analysis(data['text'].replace('\n',''))
noun_list = get_list_from_dict(result, u'noun')
verb_list = get_list_from_dict(result, u'verb')
adjective_list = get_list_from_dict(result, u'adjective')
adverb_list = get_list_from_dict(result, u'adverb')
item = {'id':data['id'], 'screen_name': data['user']['screen_name'],
'text':data['text'].replace('\n',''), 'created_datetime':str_to_date_jp(data['created_at']),\
'verb':verb_list, 'adjective':adjective_list, 'noun': noun_list, 'adverb':adverb_list}
if 'lang' in data.keys():
item['lang'] = data['lang']
else:
item['lang'] = None
#Added sentiment analysis results####################
word_list = [word for k in result.keys() for word in result[k] ]
item['sentiment'] = get_setntiment(word_list)
streamdata.insert(item)
if cnt%1000==0:
logger.info("%d, "%cnt)
cnt += 1
except IncompleteRead as e:
logger.error( '===error contents===')
logger.error( 'type:' + str(type(e)))
logger.error( 'args:' + str(e.args))
logger.error( 'message:' + str(e.message))
logger.error( 'e self:' + str(e))
try:
if type(e) == exceptions.KeyError:
logger.error( data.keys())
except:
pass
except Exception as e:
logger.error( '===error contents===')
logger.error( 'type:' + str(type(e)))
logger.error( 'args:' + str(e.args))
logger.error( 'message:' + str(e.message))
logger.error( 'e self:' + str(e))
try:
if type(e) == exceptions.KeyError:
logger.error( data.keys())
except:
pass
except:
logger.error( "error.")
logger.info( "finished.")
Up to this point, the analysis method has been a simple method of simply assigning emotional values to each word and averaging them. As for future development, we think that spam classification will be an issue as further preprocessing, and in the current situation, processing that takes into account the relationships between words will be issues. In particular, words such as "not cute" are divided into "cute" and "not", and "not" is denied, so the positive expression of "cute" +1.0 should be canceled with "not" to make it -1.0. Is natural, but at present, only "cute" is processed and it becomes +1.0, which is the opposite result.
In order to handle this correctly, it is necessary to relate and interpret which word "not" depends on by a method called "dependency analysis". In the next section, I would like to explain how to install the dependency analysis library CaboCha first.
So I would like to handle the installation of the dependency analysis library CaboCha http://taku910.github.io/cabocha/ on Mac. It took a lot of time to install, so I hope it will be helpful.
** Download CaboCha ** https://drive.google.com/folderview?id=0B4y35FiV1wh7cGRCUUJHVTNJRnM&usp=sharing#list
A library called CRF + is required to install CaboCha. ** CRF + page ** http://taku910.github.io/crfpp/#install
** CRF + Download ** https://drive.google.com/folderview?id=0B4y35FiV1wh7fngteFhHQUN2Y1B5eUJBNHZUemJYQV9VWlBUb3JlX0xBdWVZTWtSbVBneU0&usp=drive_web#list
After downloading, unzip and make & install. Since there are some necessary environment variables and libraries, their application is also described below.
tar zxfv CRF++-0.58.tar
cd CRF++-0.58
./configure
make
sudo make install
export LIBRARY_PATH="/usr/local/include:/usr/local/lib:"
export CPLUS_INCLUDE_PATH="/usr/local/include:/opt/local/include"
export OBJC_INCLUDE_PATH="/usr/local/include:/opt/local/lib"
brew tap homebrew/dupes
brew install libxml2 libxslt libiconv
brew link --force libxml2
brew link --force libxslt
brew link libiconv —force
tar zxf cabocha-0.69.tar.bz2
cd cabocha-0.69
./configure --with-mecab-config=`which mecab-config` --with-charset=UTF8
make
make check
sudo make install
#[output: install information]
#.././install-sh -c -d '/usr/local/share/man/man1'
#/usr/bin/install -c -m 644 cabocha.1 '/usr/local/share/man/man1'
#./install-sh -c -d '/usr/local/bin'
#/usr/bin/install -c cabocha-config '/usr/local/bin'
#./install-sh -c -d '/usr/local/etc'
#/usr/bin/install -c -m 644 cabocharc '/usr/local/etc'
cd cabocha-0.69/python
python setup.py install
cp build/lib.macosx-10.10-intel-2.7/_CaboCha.so /Library/Python/2.7/site-packages
cp build/lib.macosx-10.10-intel-2.7/CaboCha.py /Library/Python/2.7/site-packages
The above installation method was created by referring to the following site.
http://qiita.com/nezuq/items/f481f07fc0576b38e81d#1-10 http://hotolab.net/blog/mac_mecab_cabocha/ http://qiita.com/t_732_twit/items/a7956a170b1694f7ffc2 http://blog.goo.ne.jp/inubuyo-tools/e/db7b43bbcfdc23a9ff2ad2f37a2c72df http://qiita.com/t_732_twit/items/a7956a170b1694f7ffc2
Try the dependency analysis with the trial text.
import CaboCha
c = CaboCha.Parser()
sentence = "Soseki handed this book to the woman who saw Ryunosuke."
tree = c.parse(sentence)
print tree.toString(CaboCha.FORMAT_TREE)
print tree.toString(CaboCha.FORMAT_LATTICE)
The result of executing this code is as follows.
output
Soseki-----------D
this-D |
Book---D |
Ryunosuke-D |
saw-D |
To women-D
I handed it over.
EOS
* 0 6D 0/1 -2.475106
Soseki noun,Proper noun,Personal name,Name,*,*,Soseki,SO SEKI,Soseki
Is a particle,Particle,*,*,*,*,Is,C,Wow
* 1 2D 0/0 1.488413
This adnominal adjective,*,*,*,*,*,this,this,this
* 2 4D 0/1 0.091699
Book noun,General,*,*,*,*,Book,Hong,Hong
Particles,Case particles,General,*,*,*,To,Wo,Wo
* 3 4D 0/1 2.266675
Ryunosuke noun,Proper noun,Personal name,Name,*,*,Ryunosuke,Ryunosuke,Ryunosuke
Particles,Case particles,General,*,*,*,To,Wo,Wo
* 4 5D 0/1 1.416783
Verb,Independence,*,*,One step,Continuous form,to see,Mi,Mi
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
* 5 6D 0/1 -2.475106
Feminine noun,General,*,*,*,*,Female,Josei,Josei
Particles,Case particles,General,*,*,*,To,D,D
* 6 -1D 0/1 0.000000
Passing verb,Independence,*,*,Godan / Sa line,Continuous form,hand over,I,I
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
The line with "*" is the analysis result, and some words following it are the clauses.
Next to * is the "phrase number". Next is the clause number of the contact, which is -1 if there is no contact. It seems that you don't have to worry about "D".
The next two numbers are the positions of heads / function words
The last number indicates the degree of engagement score. Generally, the larger the value, the easier it is to engage.
So, the first phrase is 0 "Soseki is", and the person in charge is 6D, so it is "passed".
In this article, I even installed CaboCha, a dependency analysis library. In the next article, I'll apply this to tweet data.
Japanese Evaluation Polarity Dictionary --Inui Okazaki Laboratory --Tohoku University Nozomi Kobayashi, Kentaro Inui, Yuji Matsumoto, Kenji Tateishi, Shunichi Fukushima. Collection of evaluation expressions for extracting opinions. Natural language processing, Vol.12, No.3, pp.203-222, 2005. Masahiko Higashiyama, Kentaro Inui, Yuji Matsumoto, Acquisition of Noun Evaluation Polarity Focusing on Predicate Selection Preference, Proceedings of the 14th Annual Meeting of the Language Processing Society, pp.584-587, 2008.
Recommended Posts