Purpose

The Twitter timeline is a txt file. Timelines for multiple users are stored in one folder. The goal of this time is to morphologically analyze all of these files using MeCab.

Background / preparation

Get timeline

I got the timeline as in the next article. [python] Get Twitter timeline for multiple users

Preparing for MeCab

For morphological analysis, use the morphological analysis engine'MeCab'. How to use on mac mecab installation procedure I was allowed to refer to.

Implementation

Get the list of file names in the folder to the list of python
A function that creates a list of timelines from a filename list
Morphological analysis function
Morphological analysis of all files in the folder

1. Get the list of file names in the folder

The file'timelines' contains all the txt files you want to work with. Store these filenames (strings) in the list'file_names'.

import glob

file_names=[]

files = glob.glob("./timelines/*")
for file in files:
    file_names.append(file)

The obtained file_names has the following form.

['./timelines/20191210_user0_***.txt',..,'./timelines/20191210_user199_***.txt']

2. A function that creates a list of timelines from a filename list

`timelines.py`



def timelines(file_list):
    timelines=[]
    for file in file_list:
        text=open(file).read()
        open(file).close()

        timelines.append([text])
    return timelines

3. Morphological analysis function

Defines a function for morphological analysis. The argument of the function is a character string, and the return value is a list of morphological analysis results.

`mecab_list.py`


import MeCab

def mecab_list(text):
    tagger = MeCab.Tagger("-Ochasen")
    tagger.parse('')
    node = tagger.parseToNode(text)
    mecab_output = []
    while node:
        word = node.surface
        wclass = node.feature.split(',')
        if wclass[0] != u'BOS/EOS':
            if wclass[6] == None:
                mecab_output.append([word,wclass[0],wclass[1],wclass[2],""])
            else:
                mecab_output.append([word,wclass[0],wclass[1],wclass[2],wclass[6]])
        node = node.next
    return mecab_output

Let's check the operation of the'mecab_list'function.


print(mecab_list('I often eat cats that I started keeping yesterday.'))
'''
result
[['yesterday', 'noun', 'Adverbs possible', '*', 'yesterday'], ['Domestication', 'verb', 'Independence', '*', 'keep'], ['Begin', 'verb', '非Independence', '*', 'Beginる'], ['Ta', '助verb', '*', '*', 'Ta'], ['cat', 'noun', 'General', '*', 'cat'], ['Is', 'Particle', '係Particle', '*', 'Is'], ['Often', 'adverb', 'General', '*', 'Often'], ['eat', 'verb', 'Independence', '*', 'eat'], ['。', 'symbol', 'Kuten', '*', '。']]

There seems to be no problem.

4. Morphological analysis of all files in the folder

mecab_results_list=[]
the_timelines=timelines(file_names)

for the_timeline in the_timelines:
    mecab_result=[]
    for twt in the_timeline:
        mecab_result.append(mecab_list(twt))
    mecab_results_list.append(mecab_result)

print(mecab_results_list)
#result
[[[['ｗ', 'symbol', 'Alphabet', '*', 'ｗ'], ['yet', 'adverb', 'Particle connection', '*', 'yet'], ['Sub', 'noun', '固有noun', 'area', 'Sub'], ['seed', 'noun', 'suffix', 'General', 'seed'], ['？', 'symbol', 'General', '*', '？'], ['But', 'Particle', '格Particle', 'General', 'But'],..,]]]]

I got the result I wanted.

environment

macOS Catalina Jupyter notebook

[python] Decompose the acquired Twitter timeline into morphemes with MeCab