This is Qiita's first post. I'm glad I managed to post without throwing it out ...
I make LINE stamps as a hobby, but when I tried to make a niche stamp that I used only with a specific individual, I said, "Then, if you can put out a ranking of words that you often use in talks with this person, you can use it for stamp creation. I thought, "Isn't it?"
By the way, I chose Python because it's a simple reason to want to get in touch with the popular Python.
You can output the talk content as a text file by selecting [Others] ⇒ [Send Talk History] from the "≡" mark on the upper right of LINE Talk.
If you just drop the talk content into a text file, it will be in the following format.
sample.txt
[LINE]Talk history with 〇〇
Save date: 2020/10/19 22:31
2015/10/10(soil)
1:04 〇 〇 Good night!
6:03 △△ Good morning!
6:33 〇 〇 Good morning(*´-`)
・ ・ ・
・ ・
・
Since there are many unnecessary data such as dates, times, LINE names, pictograms, and spaces, delete them. By the way, you can delete the LINE name by specifying it with the replace function as it is, but since there are various patterns for the date and time, delete it using a regular expression.
sample.txt
Good night! good morning! good morning·········
This completes the data to be read by the program.
The whole is as follows
sample.py
import MeCab as mc
from collections import Counter
import sys
from sys import argv
import matplotlib.pyplot as plt
import seaborn as sb
#Get arguments
input_file_name= sys.argv[1]
#Words using mecab(Phrase)Divide into
def mecab_analysis(text):
m = mc.Tagger('')
m_result = m.parse(text).splitlines()
m_result = m_result[:-1]
break_pos = ['noun','verb','Prefix','adverb','感verb','adjective','形容verb','Adnominal adjective']
wakachi = ['']
afterPrepos = False
afterSahenNoun = False
for v in m_result:
if '\t' not in v: continue
surface = v.split('\t')[0]
pos = v.split('\t')[1].split(',')
pos_detail = ','.join(pos[1:4])
noBreak = pos[0] not in break_pos
noBreak = noBreak or 'suffix' in pos_detail
noBreak = noBreak or (pos[0]=='verb' and 'Change connection' in pos_detail)
noBreak = noBreak or 'Non-independent' in pos_detail
noBreak = noBreak or afterPrepos
noBreak = noBreak or (afterSahenNoun and pos[0]=='verb' and pos[4]=='Sahen Suru')
if noBreak == False:
wakachi.append("")
wakachi[-1] += surface
afterPrepos = pos[0]=='Prefix'
afterSahenNoun = 'Change connection' in pos_detail
if wakachi[0] == '': wakachi = wakachi[1:]
return wakachi
#Display the acquired words in the figure
def show_data():
sb.set(context="talk")
sb.set(font='Yu Gothic')
fig = plt.subplots(figsize=(8, 8))
text = str(open(input_file_name,"r",encoding="utf-8").read())
words = mecab_analysis(text)
counter = Counter(words)
#For the time being, get the top 10
sb.countplot(y=words,order=[i[0] for i in counter.most_common(10)])
plt.show()
def main():
show_data()
if __name__ == '__main__':
main()
For the process of dividing into words in the def mecab_analysis (text)
part, I referred to the following article.
Separate Japanese into phrase units [Python] [MeCab]
Now, let's output the ranking, but prepare the data you want to output the ranking as follows.
rank_sample.txt
Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. It's nice weather today, is not it. I slept well today. It's bad weather today.
After that, specify python [program path] [data file path]
in the command, and execute it, the output will be as follows.
In this way, we were able to output in order from the top of the ranking.
This time, I wanted to see the ranking of frequently-used words for the time being, so I thought I could do it quickly, but I had a hard time with Python for the first time due to lack of knowledge. However, I was finally able to touch Python, which I had been interested in for a long time. And since I hadn't touched the program at work for a while, I learned a lot. However, I could hardly understand the "natural language processing" used in the "mecab" used in the frequent word extraction this time, and for the time being, I just said "I used it." Since it's a big deal, I think I'll take this opportunity to study "natural language processing".
Recommended Posts