If you want to count words in Python, it's convenient to use Counter.

I played with Mecab and found it when I thought it was okay, so make a note.

It can be text or CSV, but I think it's rare that you want to write code that counts the frequency of occurrence of each element in a list that has duplicates. If you implement it obediently using a dictionary


data = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']

word_and_counts = {}
for word in data:
    if word_and_counts.has_key(word):
        word_and_counts[word] += 1
    else:
        word_and_counts[word] = 1
for w, c in sorted(word_and_counts.iteritems(), key=lambda x: x[1], reverse=True):
    print w, c  # =>
                #   aaa 2
                #   bbb 1
                #   ccc 1
                #   ddd 1

I think it will be like that.

In such a case, the collections module is convenient. So reimplement it using collections.Counter.

from collections import Counter

data = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']
counter = Counter(data)
for word, cnt in counter.most_common():
    print word, cnt # =>
                    #   aaa 2
                    #   bbb 1
                    #   ccc 1
                    #   ddd 1

I was able to implement it concisely. Moreover, it seems to be early because it is built-in. Besides, Counter has various other operators and convenient methods.

from collections import Counter

dataA = ['aaa', 'bbb', 'ccc', 'aaa', 'ddd']
dataB = ['aaa', 'bbb', 'bbb', 'bbb', 'abc']

counterA = Counter(dataA)
counterB = Counter(dataB)

counter = counterA + counterB  #The frequency can be added
counterA.subtract(counterB)  #Take the difference between the elements (destructive method)
counter.most_common(3)  #Get the top 3 elements (as in the example above, if you omit the omission of the argument n, you get all the elements in descending order)
#Some others

Any object that can be hashed is fine, so maybe there are other good uses?

Besides, the collections module has some useful classes that look good, so I think it's sometimes useful to read it once.

Finally, using Counter, the code that I tried Mecab in the tweet history of the downloaded Twitter looks like the following.

# -*- coding: utf-8 -*-

from collections import Counter
import codecs
import json

import MeCab


#I have a feeling of bad know-how, but I want to redirect the output result
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

#codecs returns unicode
#There is an extra description on the first line and it is a tedious test code and it is troublesome so let's delete it in advance
_tweetfile = codecs.open('./data/js/tweets/2013_09.js', 'r', 'sjis')
tweets = json.load(_tweetfile)
#Encode because Mecab only accepts str type
texts = (tw['text'].encode('utf-8') for tw in tweets)

tagger = MeCab.Tagger('-Ochasen')
counter = Counter()
for text in texts:
    nodes = tagger.parseToNode(text)
    while nodes:
        if nodes.feature.split(',')[0] == 'noun':
            word = nodes.surface.decode('utf-8')
            counter[word] += 1
        nodes = nodes.next
for word, cnt in counter.most_common():
    print word, cnt

The part that distinguishes whether it is a noun or not is dull, and the symbol is inserted, but it moved to a good feeling for the time being. I'm happy.

I've put together these tricks, so if you don't mind, please (Frequent idioms that make Python code a little cleaner just by remembering it)

Recommended Posts

If you want to count words in Python, it's convenient to use Counter.

[Python] When you want to use all variables in another file

If you want to assign csv export to a variable in python

If you want to use field names with hyphens when updating firestore data in python

What to do if you can't use scikit grid search in Python

Use PIL in Python to extract only the data you want from Exif

If you want to make a discord bot with python, let's use a framework

If you want to use Cython, also include python-dev

I want to use the R dataset in python

Solution when you want to use cv_bridge with python3 (virtualenv)