This is Qiita's first post. I'm glad I managed to post without throwing it out ...

Introduction

I make LINE stamps as a hobby, but when I tried to make a niche stamp that I used only with a specific individual, I said, "Then, if you can put out a ranking of words that you often use in talks with this person, you can use it for stamp creation. I thought, "Isn't it?"

By the way, I chose Python because it's a simple reason to want to get in touch with the popular Python.

environment

OS
Windows10
Ubuntu 20 04.1 LTS
* I used a method called WSL (Windows Subsystem for Linux) instead of dual boot.
* When installing "mecab-ipadic-neologd" described later, I added it because I could not install it properly in the Windows environment. -** Programming language **
Python 3.9.0 -** Editor **
ATOM
* It's not an IDE (Integrated Development Environment) but just an editor, but I chose this because it seems to be easy to use and I could use various useful functions by installing the package. -** sakura editor **
* Used for processing talk data extracted from LINE (removing unnecessary words, blanks, etc.)
Library
* Others are installed when using Python and Mecab, but the main ones are as follows.
Mecab
* Mecab is an open source morphological analysis software, and if you install it, it will be easily separated by part of speech.
seaborn
* seaborn is a Python data visualization library, and I used it because it looked beautiful when ranking.
mecab-ipadic-neologd
* "mecab-ipadic-neologd" is a dictionary that is strong against new words and named entities, and I installed it because it is a word extraction from LINE.
* When I tried to install it on Windows as it was, it didn't work, so I decided to download it to ubuntu once and copy it to the Windows side. For the story here, I referred to the following site.
Install Mecab and use it with Python [Windows]

Analytical data creation

1. Extract talk content from LINE talk

You can output the talk content as a text file by selecting [Others] ⇒ [Send Talk History] from the "≡" mark on the upper right of LINE Talk.

2. Make the talk history available as data.

If you just drop the talk content into a text file, it will be in the following format.

`sample.txt`


[LINE]Talk history with 〇〇
Save date: 2020/10/19 22:31

2015/10/10(soil)
1:04 〇 〇 Good night!
6:03 △△ Good morning!
6:33 〇 〇 Good morning(*´-`)
・ ・ ・
・ ・
・

〇〇: Your LINE name
△△: LINE name of the other party

Since there are many unnecessary data such as dates, times, LINE names, pictograms, and spaces, delete them. By the way, you can delete the LINE name by specifying it with the replace function as it is, but since there are various patterns for the date and time, delete it using a regular expression.

I didn't know how to delete the emojis at once due to my lack of knowledge, so I just copied the emojis "(* ´-`)" etc. in the talk and deleted them with the replace function.・ I will do my best to delete it as described above, and finally I will make only the character string as shown below.

`sample.txt`


Good night! good morning! good morning·········

This completes the data to be read by the program.

code

The whole is as follows

`sample.py`


import MeCab as mc
from collections import Counter
import sys
from sys import argv
import matplotlib.pyplot as plt
import seaborn as sb

#Get arguments
input_file_name= sys.argv[1]

#Words using mecab(Phrase)Divide into
def mecab_analysis(text):
    m = mc.Tagger('')
    m_result = m.parse(text).splitlines()
    m_result = m_result[:-1]
    break_pos = ['noun','verb','Prefix','adverb','感verb','adjective','形容verb','Adnominal adjective']
    wakachi = ['']
    afterPrepos = False
    afterSahenNoun = False
    for v in m_result:
        if '\t' not in v: continue
        surface = v.split('\t')[0]
        pos = v.split('\t')[1].split(',')
        pos_detail = ','.join(pos[1:4])
        noBreak = pos[0] not in break_pos
        noBreak = noBreak or 'suffix' in pos_detail
        noBreak = noBreak or (pos[0]=='verb' and 'Change connection' in pos_detail)
        noBreak = noBreak or 'Non-independent' in pos_detail
        noBreak = noBreak or afterPrepos
        noBreak = noBreak or (afterSahenNoun and pos[0]=='verb' and pos[4]=='Sahen Suru')
        if noBreak == False:
            wakachi.append("")
        wakachi[-1] += surface
        afterPrepos = pos[0]=='Prefix'
        afterSahenNoun = 'Change connection' in pos_detail
    if wakachi[0] == '': wakachi = wakachi[1:]
    return wakachi

#Display the acquired words in the figure
def show_data():

    sb.set(context="talk")
    sb.set(font='Yu Gothic')
    fig = plt.subplots(figsize=(8, 8))
    text = str(open(input_file_name,"r",encoding="utf-8").read())
    words = mecab_analysis(text)
    counter = Counter(words)
    #For the time being, get the top 10
    sb.countplot(y=words,order=[i[0] for i in counter.most_common(10)])
    plt.show()

def main():
   show_data()

if __name__ == '__main__':
   main()

For the process of dividing into words in the def mecab_analysis (text) part, I referred to the following article.
Separate Japanese into phrase units [Python] [MeCab]

At first, when I divided each part of speech, the sentences were too decomposed, and even if I put out the ranking, the first place was "ta" ... That didn't make much sense, so I wanted to put out rankings with meaningful words by separating them by phrases rather than by part of speech, and when I was researching various things, I found the above article and used it as a reference.

Now, let's output the ranking, but prepare the data you want to output the ranking as follows.

`rank_sample.txt`


Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. Good morning. Hello. goodbye. It's nice weather today, is not it. I slept well today. It's bad weather today.

After that, specify python [program path] [data file path] in the command, and execute it, the output will be as follows. In this way, we were able to output in order from the top of the ranking.

Finally

This time, I wanted to see the ranking of frequently-used words for the time being, so I thought I could do it quickly, but I had a hard time with Python for the first time due to lack of knowledge. However, I was finally able to touch Python, which I had been interested in for a long time. And since I hadn't touched the program at work for a while, I learned a lot. However, I could hardly understand the "natural language processing" used in the "mecab" used in the frequent word extraction this time, and for the time being, I just said "I used it." Since it's a big deal, I think I'll take this opportunity to study "natural language processing".

I tried to put out the frequent word ranking of LINE talk with Python

Introduction

environment

Analytical data creation

1. Extract talk content from LINE talk

2. Make the talk history available as data.

sample.txt

sample.txt

code

sample.py

rank_sample.txt

Finally

`sample.txt`

`sample.txt`

`sample.py`

`rank_sample.txt`