The Twitter timeline is a txt file. Timelines for multiple users are stored in one folder. The goal of this time is to morphologically analyze all of these files using MeCab.
I got the timeline as in the next article. [python] Get Twitter timeline for multiple users
For morphological analysis, use the morphological analysis engine'MeCab'. How to use on mac mecab installation procedure I was allowed to refer to.
Get the list of file names in the folder to the list of python
A function that creates a list of timelines from a filename list
Morphological analysis function
Morphological analysis of all files in the folder
The file'timelines' contains all the txt files you want to work with. Store these filenames (strings) in the list'file_names'.
import glob
file_names=[]
files = glob.glob("./timelines/*")
for file in files:
file_names.append(file)
The obtained file_names has the following form.
['./timelines/20191210_user0_***.txt',..,'./timelines/20191210_user199_***.txt']
timelines.py
def timelines(file_list):
timelines=[]
for file in file_list:
text=open(file).read()
open(file).close()
timelines.append([text])
return timelines
Defines a function for morphological analysis. The argument of the function is a character string, and the return value is a list of morphological analysis results.
mecab_list.py
import MeCab
def mecab_list(text):
tagger = MeCab.Tagger("-Ochasen")
tagger.parse('')
node = tagger.parseToNode(text)
mecab_output = []
while node:
word = node.surface
wclass = node.feature.split(',')
if wclass[0] != u'BOS/EOS':
if wclass[6] == None:
mecab_output.append([word,wclass[0],wclass[1],wclass[2],""])
else:
mecab_output.append([word,wclass[0],wclass[1],wclass[2],wclass[6]])
node = node.next
return mecab_output
Let's check the operation of the'mecab_list'function.
print(mecab_list('I often eat cats that I started keeping yesterday.'))
'''
result
[['yesterday', 'noun', 'Adverbs possible', '*', 'yesterday'], ['Domestication', 'verb', 'Independence', '*', 'keep'], ['Begin', 'verb', '非Independence', '*', 'Beginる'], ['Ta', '助verb', '*', '*', 'Ta'], ['cat', 'noun', 'General', '*', 'cat'], ['Is', 'Particle', '係Particle', '*', 'Is'], ['Often', 'adverb', 'General', '*', 'Often'], ['eat', 'verb', 'Independence', '*', 'eat'], ['。', 'symbol', 'Kuten', '*', '。']]
There seems to be no problem.
mecab_results_list=[]
the_timelines=timelines(file_names)
for the_timeline in the_timelines:
mecab_result=[]
for twt in the_timeline:
mecab_result.append(mecab_list(twt))
mecab_results_list.append(mecab_result)
print(mecab_results_list)
#result
[[[['w', 'symbol', 'Alphabet', '*', 'w'], ['yet', 'adverb', 'Particle connection', '*', 'yet'], ['Sub', 'noun', '固有noun', 'area', 'Sub'], ['seed', 'noun', 'suffix', 'General', 'seed'], ['?', 'symbol', 'General', '*', '?'], ['But', 'Particle', '格Particle', 'General', 'But'],..,]]]]
I got the result I wanted.
macOS Catalina Jupyter notebook
Recommended Posts