--Introduction
I'm addicted to sentiment analysis of text data in natural language processing. While doing that, I found it difficult to deal with net slang. So, this time, I decided to verify for myself whether the sentiment analysis service (hereinafter referred to as Natural Language API) in the Google Cloud Natural Language API supports net slang. (Caution) Just by verifying it yourself, it does not necessarily determine whether the Natural Language API supports net slang.
Depending on the sentence, I thought that "laughing" could be replaced with "grass", so I will use it for evaluation. For example, the following statements have the same meaning.
・ 3rt4 likes laugh in 3 minutes ・ 3rt4 likes in 3 minutes is grass
When multiple sentences are prepared and evaluated using the Natural Language API, it is tested whether there is a difference in the average score between "laughing" and "grass".
The procedure is as follows.
Since an application is required to use the twitter API, I applied while looking at [1]. The application passed in one day.
Now that the application has passed, we will get the text data. It is based on the code in [2]. Since there is preprocessing, I tried to write the acquired text data to a text file. I am searching with the search keyword "laughing".
import json
from requests_oauthlib import OAuth1Session
#OAuth authentication part
CK = ""
CS = ""
AT = ""
ATS = ""
twitter = OAuth1Session(CK, CS, AT, ATS)
url = 'https://api.twitter.com/1.1/search/tweets.json'
keyword = 'laugh'
params ={
'count' : 100, #Number of tweets to get
'q' : keyword, #Search keyword
}
f = open('./data/1/backup1.txt','w')
req = twitter.get(url, params = params)
print(req.status_code)
if req.status_code == 200:
res = json.loads(req.text)
for line in res['statuses']:
print(line['text'])
f.write(line['text'] + '\n')
print('*******************************************')
else:
print("Failed: %d" % req.status_code)
The search results are as follows.
・ Sure, I'm out of the hall, but sumo laughs ・ Because it's a place to laugh! Laugh! !! ・ What's that wwww laughing wwww
Arrange the acquired text data. There are four tasks to be done here.
1 and 2 are implemented as follows. 2 has a line break in the tweet, and I felt it was very difficult to do 3, so I removed it.
import re
readF = open('./data/1/backup1.txt','r')
writeF = open('./data/1/preprocessing1.txt','w')
lines = readF.readlines()
for line in lines:
if 'laugh' in line:
#Removal of "RT"
line = re.sub('RT ', "", line)
#Removal of "@XXXX" or "@XXXX"
line = re.sub('(@\w*\W* )|(@\w*\W*)', "", line)
writeF.write(line)
readF.close()
writeF.close()
3 was the hardest. ・ "Laughing" is at the end of the sentence ・ Kuten after "laughing" ・ "W" after "laughing" In such a case, I thought that I could replace it with "grass" with high probability, but I thought that the data would be biased. In the end, it was judged manually. Text data that we determined could not be replaced was removed.
The number of samples is now 200.
4 was implemented as follows.
import csv
import pandas as pd
count = 6
lines = []
for i in range(count):
print(i)
readF = open('./data/'+ str(i+1) + '/preprocessing' + str(i+1) + '.txt')
lines += readF.readlines()
df = pd.DataFrame([],columns=['warau', 'kusa'])
replaceLines = []
for line in lines:
replaceLines.append(line.replace('laugh', 'grass'))
df["warau"] = lines
df["kusa"] = replaceLines
df.to_csv("./data/preprocessing.csv",index=False)
The result of the processing so far is as shown in the image below.
Google Cloud Natural Language API The sentiment analysis service in the Google Cloud Natural Language API returns the emotion score that the text has. The closer the emotion score is to 1, the more positive it is, and the closer it is to -1, the more negative it is [3]. In addition to sentiment analysis services, the Google Cloud Natural Language API also includes content classification.
The program was implemented based on [4]. Pass the "laughing" and "grass" sentences to the Natural Language API, and store the results in a List. Then add it to pandas with "warauResult" and "kusaResult" as column names. Finally, output the csv file.
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import os
import pandas as pd
credential_path = "/pass/xxx.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
client = language.LanguageServiceClient()
warauResultList = []
kusaResultList = []
df = pd.read_csv('./data/preprocessing.csv')
count = 9
for index,text in df.iterrows():
#\removal of n
text["warau"] = text["warau"].replace('\n', '')
text["kusa"] = text["kusa"].replace('\n', '')
#analysis of warau
document = types.Document(
content=text["warau"],
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
warauResultList.append(sentiment.score)
#kusa analysis
document = types.Document(
content=text["kusa"],
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
kusaResultList.append(sentiment.score)
df["warauResult"] = warauResultList
df["kusaResult"] = kusaResultList
df.to_csv("./data/result.csv",index=False)
The result of the processing so far is as shown in the image below.
The histogram of warauResult is as follows.
The histogram of kusaResult is as follows.
Suppose each follows a normal distribution.
Compare the value stored in warauResult with the value stored in kusaResult. This time, we will test the mean difference when there is a correspondence between the samples. I referred to [5] and [6].
・ Null hypothesis ・ ・ ・ The score did not change even if "laughing" was changed to "grass". ・ Alternative hypothesis ・ ・ ・ The score changed by changing "laughing" to "grass"
The program looks like this:
from scipy import stats
import pandas as pd
#Test of mean difference when there is a correspondence between samples
df = pd.read_csv('./data/result.csv')
stats.ttest_rel(df["warauResult"], df["kusaResult"])
The results are as follows. Ttest_relResult(statistic=3.0558408995373356, pvalue=0.0025520814940409413)
The reference for stats.ttest_rel is [7].
Quote: "If the p-value is smaller than the threshold, e.g. 1%, 5% or 10%, then we reject the null hypothesis of equal averages."
In other words, this time, the pvalue is as small as about 2.5%, so the null hypothesis is rejected. Therefore, changing "laughing" to "grass" will change the score. The specimen has only sentences with "laughing" that can be replaced with "grass" (subjective). However, the change in score concludes that the Natural Language API is not compatible with net slang.
Average interval estimation is performed for each of warauResult and kusaResult. I referred to [8].
\begin{aligned}
\bar{X}-z_{\frac{\alpha}{2}}\sqrt{\frac{s^2}{n}}
< \mu <
\bar{X}+z_{\frac{\alpha}{2}}\sqrt{\frac{s^2}{n}}
\end{aligned}
The program looks like this:
from scipy import stats
import math
print("sample mean of warauResult",df['warauResult'].mean())
print("Sample mean of kusaResult",df['kusaResult'].mean())
#.var()Finds unbiased variance
print("WarauResult interval estimation",stats.norm.interval(alpha=0.95,
loc=df['warauResult'].mean(),
scale=math.sqrt(df['warauResult'].var() / len(df))))
print("Interval estimation of kusaResult",stats.norm.interval(alpha=0.95,
loc=df['kusaResult'].mean(),
scale=math.sqrt(df['kusaResult'].var() / len(df))))
The results are as follows. WarauResult sample mean 0.0014999993890523911 Sample mean of kusaResult -0.061000001728534696 Interval estimation of warauResult (-0.0630797610044764, 0.06607975978258118) Interval estimation of kusaResult (-0.11646731178466276, -0.005532691672406637)
Error range ・ WarauResult: Approximately ± 0.06458 ・ KusaResult: Approximately ± 0.05546
The range of emotional scores returned by the Natural Language API is 1 to -1. I thought that the error ± 0.06 in this range was small.
By the way, you can get the required number of samples based on the error range as shown in [9]. ・ About warauResult ・ Confidence coefficient 95% ・ Error range ± 0.06458 At this time, the number of samples is 200.
import numpy as np
#Since we do not know the standard deviation of the population, we substitute the square root of the unbiased variance.
rutoN = (1.96 * np.sqrt(df['warauResult'].var()))/ 0.06458
N = rutoN * rutoN
print(N)
The results are as follows. 200.0058661538003
・ It is not objective because it is judged by one person whether it is "laughing" that can be replaced with "grass". → Evaluate with multiple people
・ The current method of collecting data cannot collect a large number of samples. → If you need a large number of samples, find a pattern and consider automation
・ How to determine the error range → I want a reason for what the error range should be
I would like to participate in the Advent Calendar next year as well.
[1]https://qiita.com/kngsym2018/items/2524d21455aac111cdee [2]https://qiita.com/tomozo6/items/d7fac0f942f3c4c66daf [3]https://cloud.google.com/natural-language/docs/basics#interpreting_sentiment_analysis_values [4]https://cloud.google.com/natural-language/docs/quickstart-client-libraries#client-libraries-install-python [5]https://bellcurve.jp/statistics/course/9453.html [6]https://ohke.hateblo.jp/entry/2018/05/19/230000 [7]https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_rel.html [8]https://ohke.hateblo.jp/entry/2018/05/12/230000 [9]https://toukeigaku-jouhou.info/2018/01/23/how-to-calculate-samplesize/