I played with Mecab and found it when I thought it was okay, so make a note.
It can be text or CSV, but I think it's rare that you want to write code that counts the frequency of occurrence of each element in a list that has duplicates. If you implement it obediently using a dictionary
data = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']
word_and_counts = {}
for word in data:
if word_and_counts.has_key(word):
word_and_counts[word] += 1
else:
word_and_counts[word] = 1
for w, c in sorted(word_and_counts.iteritems(), key=lambda x: x[1], reverse=True):
print w, c # =>
# aaa 2
# bbb 1
# ccc 1
# ddd 1
I think it will be like that.
In such a case, the collections module is convenient. So reimplement it using collections.Counter.
from collections import Counter
data = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']
counter = Counter(data)
for word, cnt in counter.most_common():
print word, cnt # =>
# aaa 2
# bbb 1
# ccc 1
# ddd 1
I was able to implement it concisely. Moreover, it seems to be early because it is built-in. Besides, Counter has various other operators and convenient methods.
from collections import Counter
dataA = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']
dataB = ['aaa', 'bbb', 'bbb', 'bbb', 'abc']
counterA = Counter(dataA)
counterB = Counter(dataB)
counter = counterA + counterB #The frequency can be added
counterA.subtract(counterB) #Take the difference between the elements (destructive method)
counter.most_common(3) #Get the top 3 elements (as in the example above, if you omit the omission of the argument n, you get all the elements in descending order)
#Some others
Any object that can be hashed is fine, so maybe there are other good uses?
Besides, the collections module has some useful classes that look good, so I think it's sometimes useful to read it once.
Finally, using Counter, the code that I tried Mecab in the tweet history of the downloaded Twitter looks like the following.
# -*- coding: utf-8 -*-
from collections import Counter
import codecs
import json
import MeCab
#I have a feeling of bad know-how, but I want to redirect the output result
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
#codecs returns unicode
#There is an extra description on the first line and it is a tedious test code and it is troublesome so let's delete it in advance
_tweetfile = codecs.open('./data/js/tweets/2013_09.js', 'r', 'sjis')
tweets = json.load(_tweetfile)
#Encode because Mecab only accepts str type
texts = (tw['text'].encode('utf-8') for tw in tweets)
tagger = MeCab.Tagger('-Ochasen')
counter = Counter()
for text in texts:
nodes = tagger.parseToNode(text)
while nodes:
if nodes.feature.split(',')[0] == 'noun':
word = nodes.surface.decode('utf-8')
counter[word] += 1
nodes = nodes.next
for word, cnt in counter.most_common():
print word, cnt
The part that distinguishes whether it is a noun or not is dull, and the symbol is inserted, but it moved to a good feeling for the time being. I'm happy.
I've put together these tricks, so if you don't mind, please (Frequent idioms that make Python code a little cleaner just by remembering it)
Recommended Posts