- This is the second trial of the Japanese emotion value dictionary.
- Continuing from the previous "Word Emotion Polarity Correspondence Table", this article will use ** "Japanese Evaluation Polarity Dictionary (Noun Edition)" **. Check the performance and consider its practicality.
- The dictionary contains a total of 13,314 words, and the polarity values are based on the binary classification of positive (+1) and negative (-1), with 3 categories including those that cannot be judged as positive or negative (0). It has become.
(1) Acquisition of "Japanese Evaluation Polar Dictionary (Noun Edition)"
1. Upload dictionary data to Colab
- Upload the dictionary file downloaded from the official website to your local PC in advance on Colaboratory.
from google.colab import files
uploaded = files.upload()
- Select the file according to the UI for the upload operation to start uploading.
2. Read dictionary data into data frame
- The "Japanese Evaluation Polar Dictionary (Noun Edition)" is registered line by line in the following format.
Word
\ t Emotion value [p, n, e]
\ t Verb phrase (fact/emotion attribute) [objective, subjective]
import pandas as pd
#Control the misalignment of column names and values
pd.set_option('display.unicode.east_asian_width', True)
pndic_1 = pd.read_csv('pn.csv.m3.120408.trim', names=['word_pn_oth'])
print(pndic_1)
3. Expand the data frame with a delimiter
- Divide into multiple columns with
'\ t'
as the delimiter and specify expand = True
as an argument to get the result as a data frame.
pndic_2 = pndic_1['word_pn_oth'].str.split('\t', expand=True)
print(pndic_2)
4. Removed noise of emotion polarity value
- As a general rule, the emotion polarity value takes one of p (positive), n (negative), and e (neutral), but when counting the number of occurrences for each element (which may be difficult for binary classification), it is actually The data is entered as follows.
senti_score = pd.Series(pndic_2[1])
senti_score.value_counts()
- Therefore, only the three categories of p, n, and e are left, and the others are deleted as noise.
pndic_3 = pndic_2[(pndic_2[1] == 'p') | (pndic_2[1] == 'e') | (pndic_2[1] == 'n')]
print(pndic_3)
5. Replace emotion polarity value with a numerical value
- Remove the unnecessary third column and replace the emotion polarity values from p, n, e with the numbers [p: +1, n: -1, e: 0].
#Delete unnecessary columns
pndic_4 = pndic_3.drop(pndic_3.columns[2], axis=1)
pndic_4[1] = pndic_4[1].replace({'p':1, 'e':0, 'n':-1})
print(pndic_4)
6. Convert data frame to dict type
- Use
dict ()
to convert column 0 to a dictionary type with column 0 as the key and column 1 as the value, and use it as the reference source to get the emotion value.
keys = pndic_4[0].tolist()
values = pndic_4[1].tolist()
dic = dict(zip(keys, values))
print(dic)
⑵ Preprocessing of the text to be analyzed
1. Specify the text
text = 'The nationwide import volume of spaghetti reached a record high by October, and customs suspects that the background is the so-called "needing demand" that has increased due to the spread of the new coronavirus infection. According to Yokohama Customs, the amount of spaghetti imported from ports and airports nationwide was approximately 142,000 tons as of the end of October. This was a record high, exceeding the import volume of one year three years ago by about 4000 tons. In addition, macaroni also had an import volume of more than 11,000 tons by October, which is almost the same as the import volume of one year four years ago, which was the highest ever.'
lines = text.split("。")
2. Create an instance of morphological analysis
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
- Import MeCab and instantiate it with output mode "-Ochasen".
- Although it is not related to a series of processing, the morphological analysis result on the first line is shown as an example.
import MeCab
mecab = MeCab.Tagger("-Ochasen")
#Illustrate the results of morphological analysis on the first line
print(mecab.parse(lines[0]))
3. List by sentence based on morphological analysis
- Since the dictionary is a "noun edition", no emotion values are attached to other words, but as an observation of that area, nouns, adjectives, verbs, and adverbs are the same as in the "word emotion polarity value correspondence table". Extract 4 categories.
word_list = []
for l in lines:
temp = []
for v in mecab.parse(l).splitlines():
if len(v.split()) >= 3:
if v.split()[3][:2] in ['noun','adjective','verb','adverb']:
temp.append(v.split()[2])
word_list.append(temp)
#Remove empty element
word_list = [x for x in word_list if x != []]
(3) Positive / negative judgment of sentences based on emotional polarity value
1. Acquisition of emotional polarity value
- Get the words that belong to each sentence and their emotional polarity values and output them to the data frame.
result = []
#Sentence-based processing
for sentence in word_list:
temp = []
#Word-based processing
for word in sentence:
word_score = []
score = dic.get(word)
word_score = (word, score)
temp.append(word_score)
result.append(temp)
#Display as a data frame for each sentence
for i in range(len(result)):
print(lines[i], '\n', pd.DataFrame(result[i], columns=["word", "score"]), '\n')
-
The results of all 4 rows are shown in the table below in order from the left.
-
NaN is an unregistered word, and as mentioned above, it is a natural result because it is a dictionary containing nouns. These unregistered problems often occur in dictionaries with a large number of words.
-
Looking at words with emotional polarity, it is no wonder that "virus" and "infection" in the first line are negative. On the other hand, "quantity" and "demand" can be negative depending on the context, but they are positive and matched in this news article that conveys the special demand of spaghetti. Is the "past" negative because it is accompanied by a backward or backward image?
-
Now, let's see what kind of results are required with a small number of elements limited to nouns for the purpose of determining the positive and negative of each sentence.
2. Mean value of emotional polarity value for each sentence
#Calculate the average value for each sentence
mean_list = []
for i in result:
temp = []
for j in i:
if not j[1] == None:
temp.append(float(j[1]))
mean = (sum(temp) / len(temp))
mean_list.append(mean)
#Display as a data frame
print(pd.DataFrame(mean_list, columns=["mean"], index=lines[0:4]))
- First, ** 1st line (emotional polarity value: -0.142857) ** talks about the positive phenomenon of "increasing demand" in the first half of the sentence, which has the negative background of "virus infection spread". It is in the context of. In other words, the emotions are reversed in the latter half of the sentence, and the judgment is a little negative.
- Next, ** The second line (same as: 1.000000) ** is a concrete explanation of the increase in demand, with "import volume of 142,000 tons". When considering spaghetti alone from an economic point of view, it is certainly a good thing, and the judgment is completely positive.
- Furthermore, ** 3rd line (same as: 0.000000) ** is an objective fact that "the amount of imports is the highest ever", so it is neither positive nor negative, so the judgment is completely neutral.
- And ** In the 4th line (same as: 0.250000) **, the topic turned to macaroni, and this also said that "import volume is the second best after spaghetti", so the judgment is also positive although it is not as good as spaghetti. I am.
- I get the impression that it matches each context surprisingly. However, since it is only a noun and has a small number of elements, and the positive negative takes only two values of +1 or -1, it may be out of the context in some cases.
⑷ Verification of "Japanese Evaluation Polar Dictionary (Noun Edition)"
1. Check the composition ratio of positive and negative
- Check the positive/negative and neutral ratios for the dict type data created from the "Japanese Evaluation Polar Dictionary (Noun Edition)".
#Number of positive words
keys_pos = [k for k, v in dic.items() if v == 1]
cnt_pos = len(keys_pos)
#Number of negative words
keys_neg = [k for k, v in dic.items() if v == -1]
cnt_neg = len(keys_neg)
#Neutral word count
keys_neu = [k for k, v in dic.items() if v == 0]
cnt_neu = len(keys_neu)
print("Percentage of positives:", ('{:.3f}'.format(cnt_pos / len(dic))), "(", cnt_pos, "word)")
print("Percentage of negatives:", ('{:.3f}'.format(cnt_neg / len(dic))), "(", cnt_neg, "word)")
print("Neutral percentage:", ('{:.3f}'.format(cnt_neu / len(dic))), "(", cnt_neu, "word)")
- Words with positive emotion polarity values account for 25.3% of the total, while negative ratios are still high at 37.4%, but the previous “Word Emotion Polarity Correspondence Table” It's not as extreme as it is.
- In addition, the ratio of neutral (13 words out of 52,671 words), which was rarely seen in the "word emotion polarity value correspondence table", is 37.4%, which is the same rate as negative in the "Japanese evaluation polarity dictionary (noun edition)". It accounts for more than 1/3 of. However, since the "word emotion polarity value correspondence table" takes a real value from -1 to +1 for the score, it is considered that it is discriminated into positive/negative as close to neutral as possible.
2. Check for duplicate words
- Compare the number of elements before and after converting dictionary data from data frame to dict type.
print("Number of elements before conversion to dict type:", len(pndic_4))
print("Number of elements after conversion to dict type:", len(dic), "\n")
- When converted to dict type, there are 3 words less, which corresponds to the number of unique elements before conversion, that is, the number of duplicate words as follows.
pndic_list = pndic_4[0].tolist()
print("Unique number of elements before conversion to dict type:", len(set(pndic_list)))
- Let's take a look at the contents of the three duplicate words. Use the
Counter ()
class of the Python standard library collections
to get the elements in descending order of occurrence.
import collections
print(collections.Counter(pndic_list))
- The three words "discipline," "credit," and "cancellation" are registered in the original "Japanese Evaluation Polar Dictionary (Noun Edition)" as follows.