Someya (@ kumi-someya), an engineer of iXIT Corporation, will be in charge of the 23rd day of XTech Group 2 Advent Calendar 2020. We will analyze the inquiry data using a little natural language processing technology that was a major of the university!
I usually live a life of building SQL when receiving a request for data extraction from the sales side, but suddenly "By the way, I haven't seen the contents of your inquiry properly ..." I thought it was the beginning.
Therefore, this time, in order to identify the parts that users find inconvenient in the service, I would like to quantify free-form data such as inquiries!
[Purpose] To clarify user dissatisfaction points by expressing free-form data in numerical values. [Hypothesis] If we can extract frequently-used words, we may know where users are likely to stumble. [Summary] Using the open source morphological analysis engine Mecab, perform simple natural language analysis with Python.
Pyhon A programming language that was not mentioned. It is often used because there are many libraries that are useful for natural language processing and analysis. I will continue to program in Python this time as well. Morphological analysis It is a technology that divides sentences into words (morphemes). (Originally, morpheme ≠ word, but this time I will simply say that it is similar.) MeCab Open source morphological analysis engine By importing, you will be able to perform morphological analysis.
① Environment (JupyterNotebook + MeCab) ② csv of inquiry contents
** 1. Install Anaconda to use Jupyter Notebook ** If you don't use Jupyter Notebook, please skip it.
Installation of Anaconda: https://www.python.jp/install/anaconda/index.html When not using Anaconda: Memo to prepare Python environment on Mac and set Jupyter Notebook ** 2. Install macab **
Run the command for MeCab installation in your terminal.
$ brew install mecab mecab-ipadic
Could not run ... It seems that HomeBrew for executing the bruw command is not installed, so
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install HomeBrew with. (It takes about 20 minutes) Install macab with the previous command (bruw install ...). After the installation is complete, check if it can be executed as follows.
$ mecab
Of the thighs and thighs
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS
It seems that morphological analysis is being performed safely! MeCab is working, but it's not available in Python yet, so use the pip command to make MeCab available in Python.
$ pip install mecab-python3
Now that you can use MeCab from your notebook, let's start using it! After launching Anaconda, you should be able to launch your notebook from "Launch" on your notebook.
When JupyterNotebook is launched, let's perform morphological analysis with MeCab immediately! It's OK if you can execute it like this! (I'll put the execution code just in case)
import MeCab #Loading MeCab
mecab_chasen = MeCab.Tagger ("-Ochasen") #Reading the dictionary
m_data = mecab_chasen.parse("This year is almost over.") #The text you want to analyze
print(m_data)
This time I will use the inquiry data (2,831 lines). Please get it from the database of the service you are involved in.
contacts.csv
"content","created"
"Exit test","2019-05-30 17:17:31"
"I feel that the FAQ such as how to use it in the store and the number of times it is used is a strange link.","2019-06-01 06:51:40"
"I want stores in front of Seijo Gakuen in Setagaya, Soshigaya Okura, Chitose Funabashi, and Kyodo.","2019-06-03 11:42:57"
"I cannot go to the main registration screen from the URL of the temporary registration completion email and cannot do the main registration.","2019-06-03 17:31:14"
"I haven't received the temporary registration completion email. What should I do?","2019-06-03 20:11:04"
"I get a valid email address, but I don't understand what it means.","2019-06-03 20:34:03"
(The following is omitted)
Now that everything is ready, let's extract the frequently used words in your inquiry!
Since the purpose of this time is to investigate the entire inquiry content, first, read contacts.csv and process the one character string.
This time csv is placed in the same hierarchy as the notebook file. Morphological analysis was performed with MeCab for all the inquiry contents of the csv file below, and the top 50 frequently-used words were output. (The execution code will be described later)
In the case of ('no', 2158), it means that "no" appears 2158 times. Frequently used words that seem to make sense are "purchase", "use", "passport (pointing to content within the service)", "update", etc ...? Words alone don't give you any context ... from here it's hard to tell where users are dissatisfied ... I will change the method a little more.
↓ Execution code ↓
import MeCab
import csv
#loading csv
all_contact_text = ''
with open('contacts.csv') as f: #Read csv line by line
reader = csv.reader(f) # ['content', 'created']Make an array in the form of
for row in reader:
all_contact_text += row[0] #Make all inquiries into one string
# all_contact_Morphological analysis with MeCab for text
mecab= MeCab.Tagger ("-Ochasen") #Reading the dictionary
node = mecab.parseToNode(all_contact_text) #Perform morphological analysis
words=[]
while node:
hinshi = node.feature.split(",")[0]
if hinshi in ["noun"]: # 今回はnounのみ抽出する(形容詞、動詞などの指定、複数指定も可能)
origin = node.feature.split(",")[6]
words.append(origin)
node = node.next
#Count the number of each word
import collections
c = collections.Counter(words)
print(c.most_common(50)) #Show the top 50 most frequently used words
Wouldn't it be a little easier to understand the context if you measure the frequency in combination with other words located before and after? For example, if the title of the article is "Analyzing user dissatisfaction very easily from the content of the inquiry", it is divided into morphemes.
O/Inquiry/Contents/From/user/of/Dissatisfaction/To/very much/Simple/Target/To/analysis/To do
It will be. This time, we will set the i-th morpheme and the i + 1-th morpheme as shown below. (When separated by two morphemes) By the way, the character "*" that appeared in the previous execution result is also an obstacle, so let's delete it.
O/Inquiry,
Inquiry/Contents,
Contents/From,
From/user
(The following is omitted)
import MeCab
import csv
#loading csv
all_contact_text = ''
with open('contacts.csv') as f: #Read csv line by line
reader = csv.reader(f) # ['content', 'created']Make an array in the form of
for row in reader:
all_contact_text += row[0] #Make all inquiries into one string
# all_contact_Morphological analysis with MeCab for text
mecab = MeCab.Tagger ("-Ochasen") #Reading the dictionary
node = mecab.parseToNode(all_contact_text)
word_arr=[]
while node:
hinshi = node.feature.split(",")[0]
if hinshi in ["noun","adjective","verb"]: # 文脈を読みたいのでadjectiveとverbも抽出対象に
origin = node.feature.split(",")[6]
word_arr.append(origin)
node = node.next
# 「*Deleted
word_arr_except_trash = []
not_trash_num = [i for i in range(len(word_arr)) if word_arr[i] != "*"]
for i in not_trash_num:
word_arr_except_trash.append(word_arr[i])
#When separating by two morphemes
join_word_arr = []
for i in range(len(word_arr_except_trash)-1):
join_word = word_arr_except_trash[i] + "/" +word_arr_except_trash[i+1]
join_word_arr.append(join_word)
#When separating by three morphemes
# join_word_arr = []
# for i in range(len(word_arr_except_trash)-2):
# join_word = word_arr_except_trash[i] + "/" +word_arr_except_trash[i+1]+"/"+word_arr_except_trash[i+2]
# join_word_arr.append(join_word)
#When separating by four morphemes
# join_word_arr = []
# for i in range(len(word_arr_except_trash)-3):
# join_word = word_arr_except_trash[i] + "/" +word_arr_except_trash[i+1]+"/"+word_arr_except_trash[i+2]+"/"+word_arr_except_trash[i+3]
# join_word_arr.append(join_word)
#Count the number of words
import collections
c = collections.Counter(join_word_arr)
print(c.most_common(50))
I ran the above and got the result separated by 2-4 morphemes.
** When separating by two morphemes: **
('Purchase/To do', 599),('Use/To do', 436),('Registration/To do', 320),('update/Day', 239), ('Automatic/update', 235)Such
I feel like I've come to understand the meaning.
At least if you're a service person, you've got a rough idea of where users are stumbling and making inquiries.
** When separated by three morphemes: **
('update/To do/To be', 202),('Automatic/update/To do', 128),('next time/update/Day', 68),('New/Registration/To do', 48),('plan/Change/To do', 30)Such
It has become quite pinpoint.
Since all the inflection systems of "do" are morphemes as "do", it seems that "'update/update/re'" is "updated".
If you think about this with an interrogative word, you can guess that there are many inquiries such as "when will it be updated".
From here, I could read the dissatisfaction that it was difficult to know the next update date someday.
** When separated by 4 morphemes: **
('Automatic/update/To do/To be', 111),('display/To do/To be/Is', 51),('update/To do/To be/Is', 50)('update/To do/To be/End up', 30),('update/Day/Moon/Day', 17)Such
As for the meaning of each combination, something similar to the case of separating with three morphemes was extracted, but
Interestingly, it has nothing to do with the content of the inquiry
('To do/To be/Is/of', 28), ('To do/Let/Have/Oru', 21), ('To do/To be/Is/As it is', 19)
The number of combinations such as these has increased, and I get the impression that the accuracy of the entire data has declined.
From the above, the case of separating by three morphemes was the most useful as data, so from the result
・ It is difficult to know when it will be automatically updated. ・ It is difficult to understand how to newly register ・ It is difficult to understand how to change the plan
It was speculated that such points may be the points that users are dissatisfied with and inconvenienced.
I think that if we use synonym estimation in earnest, we will be able to obtain data that is more accurate and ready for action, but that requires full-scale natural language processing technology, so it is necessary to study ... Also, this time we were targeting the contents of inquiries for the entire period, so even if the situation changes due to service renovation, we can not grasp it. If you narrow down the period or analyze by month, you may be able to find out the reaction of users before and after the release.
For services with a large number of inquiries, I think it can be used (for services that do not require complicated calculations), so I hope you will use it!
Tomorrow's XTech Group Advent Calendar 2020 will be in charge of @kohei_sasaki and @ branch10480. Please continue to enjoy!
Morphological analysis with MeCab on Mac Introduction to Natural Language Processing for the First Time 1 ~ Let's Move MeCab ~ How to extract frequently-used words in sentences with Python
Recommended Posts