100 Language Processing Knock-38 (using pandas): Histogram

Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the 38th "histogram" record of .ac.jp/nlp100/#ch4). It's easy as long as you get over the "tofu" that was knocked last time. If you don't put out the label, you don't have to deal with "tofu".

Reference link

Link	Remarks
038.histogram.ipynb	Answer program GitHub link
100 amateur language processing knocks:38	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
matplotlib	3.1.3
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

38. Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).

Answer

Answer Program [038. Histogram.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7 % A3% E6% 9E% 90 / 038.% E3% 83% 92% E3% 82% B9% E3% 83% 88% E3% 82% B0% E3% 83% A9% E3% 83% A0.ipynb)

import matplotlib.pyplot as plt
import pandas as pd

plt.rcParams['font.family'] = 'IPAexGothic'

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()

hist = df['surface'].value_counts().plot.hist(bins=20, range=(1, 20))
hist.set_xlabel('Frequency of appearance')
hist.set_ylabel('Number of word types')

Answer commentary

Just use the plot of pandas. I also added a label.

`python`


hist = df['surface'].value_counts().plot.hist(bins=20, range=(1, 20))
hist.set_xlabel('Frequency of appearance')
hist.set_ylabel('Number of word types')

Output result (execution result)

When the program is executed, the following results will be output. Well, it looks like this.

Recommended Posts

100 Language Processing Knock-38 (using pandas): Histogram

100 Language Processing Knock-31 (using pandas): Verb

100 Language Processing Knock-33 (using pandas): Sahen noun

100 Language Processing Knock-39 (using pandas): Zipf's Law

100 Language Processing Knock-34 (using pandas): "A B"

100 language processing knock-20 (using pandas): reading JSON data

100 Language Processing Knock-32 (using pandas): Prototype of verb

100 language processing knock-98 (using pandas): Ward's method clustering

100 language processing knock-99 (using pandas): visualization by t-SNE

100 language processing knock-95 (using pandas): Rating with WordSimilarity-353

100 Language Processing Knock (2020): 28

100 language processing knock-76 (using scikit-learn): labeling

100 Language Processing Knock-36 (using pandas): Frequency of word occurrence

100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)

100 Language Processing Knock-83 (using pandas): Measuring word / context frequency

100 language processing knock-30 (using pandas): reading morphological analysis results

100 Language Processing Knock-84 (using pandas): Creating a word context matrix

100 language processing knock 2020 [00 ~ 39 answer]

100 language processing knock 2020 [00-79 answer]

100 language processing knock 2020 [00 ~ 69 answer]

100 Language Processing Knock 2020 Chapter 1

100 Amateur Language Processing Knock: 17

100 Language Processing Knock-52: Stemming

100 Language Processing Knock Chapter 1

100 Amateur Language Processing Knock: 07

100 Language Processing Knock 2020 Chapter 3

100 Language Processing Knock 2020 Chapter 2

100 Amateur Language Processing Knock: 09

100 Language Processing Knock-53: Tokenization

100 Amateur Language Processing Knock: 97

100 language processing knock 2020 [00 ~ 59 answer]

100 language processing knock-90 (using Gensim): learning with word2vec

100 language processing knock-79 (using scikit-learn): precision-recall graph drawing

100 language processing knock-72 (using Stanford NLP): feature extraction

100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks

100 Language Processing with Python Knock 2015

100 Language Processing Knock-51: Word Clipping

100 Language Processing Knock-57: Dependency Analysis

100 language processing knock-50: sentence break

100 Language Processing Knock Chapter 1 (Python)

100 Language Processing Knock Chapter 2 (Python)

100 Language Processing Knock-25: Template Extraction

100 Language Processing Knock-87: Word Similarity

I tried 100 language processing knock 2020

100 language processing knock-56: co-reference analysis

Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")

100 Amateur Language Processing Knock: Summary

100 language processing knock-92 (using Gensim): application to analogy data

100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353

100 language processing knocks-37 (using pandas): Top 10 most frequent words

100 Language Processing Knock 2020 Chapter 2: UNIX Commands

100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)

100 Language Processing Knock with Python (Chapter 1)

100 Language Processing Knock Chapter 1 in Python

100 Language Processing Knock 2020 Chapter 4: Morphological Analysis

100 language processing knock-77 (using scikit-learn): measurement of correct answer rate

100 Language Processing Knock with Python (Chapter 3)

100 Language Processing Knock: Chapter 1 Preparatory Movement

100 Language Processing Knock Chapter 4: Morphological Analysis

100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)

100 Language Processing Knock 2020 Chapter 5: Dependency Analysis