Output result

Overview

NLTK (Natural Language Processing Library) plot function (graph output) enables Japanese to be used. Oliley book "Introduction to Natural Language Processing" ([-> English version [free]](http: / /www.nltk.org/book/)) in the chapter Japanese Natural Language Processing with Python "However, note that Japanese characters are garbled by default in matplotlib." I couldn't find a solution, so I dealt with it myself.

Prerequisite knowledge

-> Japanese natural language processing with Python

environment

LinuxMint13(Ubuntu12.04)

code

`NLTK Japanese plot.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

import MeCab
import nltk
from numpy import *
from nltk.corpus.reader import *
from nltk.corpus.reader.util import *
from nltk.text import Text
import jptokenizer

###matplotlib Specify default font###← Point 1: Explicitly specify Japanese font
import matplotlib
import matplotlib.font_manager as font_manager
#TTF file(font)Specify the address of
font_path = '/usr/share/fonts/truetype/fonts-japanese-gothic.ttf'
#Get detailed font information
font_prop = font_manager.FontProperties(fname = font_path)
#Use the font name and specify it as the default font for matplot
matplotlib.rcParams['font.family'] = font_prop.get_name()

###Japanese corpus(unicode)Creation###← Point 2: Words are managed by unicode
#Load the corpus
jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^　「」！？。]*[！？。]')
reader = PlaintextCorpusReader("/home/User/desktop", r'NKMK.txt',
                                encoding='utf-8',
                                para_block_reader=read_line_block,
                                sent_tokenizer=jp_sent_tokenizer,
                                word_tokenizer=jptokenizer.JPMeCabTokenizer())
#Get word group by unicode specification from corpus
nkmk = Text(unicode(w) for w in reader.words())

###drawing###← Point 3: Arguments are also specified in unicode
nkmk.dispersion_plot([u'Nico',u'Maki',u'Here',u'Heart'])

Commentary

(See comments in the source)

Task

The label of ConditionalFreqDist.plot () cannot be translated into Japanese. If you read /usr/local/lib/python2.7/dist-packages/nltk/probability.py, "kwargs ['label'] = str (condition)" (line 1790). In other words, the label string is output through the str () function, so Japanese is definitely garbled. The correction method is to change the previous line to "kwargs ['label'] = unicode (condition)". If there is a similar case, it seems that the library needs to be modified as well.

[Before correction]

[Revised]

Reference site

-> About Japanese in Matplotlib -> How to output Japanese with plot () of nltk.FreqDist and nltk.ConditionalFreqDist-(Mainly) Programming memo

How to use Japanese with NLTK plot