This article is a sequel to my book Introduction to Python with 100 Knock. I will explain 100 knocks Chapter 4.
First, install the morphological analyzer MeCab, download neko.txt, analyze the morpheme, and check the contents.
$ mecab < neko.txt > neko.txt.mecab
"I am a cat" from Aozora Bunko.
MeCab's default dictionary system is similar to school grammar, but the adjective verbs are nouns + auxiliary verbs, and the sa-variable verbs are nouns + verbs. The output format is probably as follows:
Surface type\t part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Utilization type,Inflected form,Prototype,reading,Pronunciation form
Also, the sentences are separated by ʻEOS`. By the way, many morphological analyzers assume full-width characters, so it is better to replace half-width characters when analyzing Web texts.
itertools It's a big deal, so let's get acquainted with the itertools module. There are methods to create a convenient iterator.
islice()
It also appeared in Chapter 2. You can slice the iterator with ʻislice (iterable, start, stop, step). If
step is omitted, it will be 1, and if
startis omitted, it will be
start = 0.
stop = None` means until the end. It's very convenient, so let's use it.
groupby() You can do something like the Unix command ʻuniq`.
from itertools import groupby
a = [1,1,1,0,0,1]
for k, g in groupby(a):
print(k, list(g))
1 [1, 1, 1]
0 [0, 0]
1 [1]
If you pass ʻiterableas the first argument like this, it will create a pair of the value of that element and the iterator that returns that group. As with
sort, you can also specify
key as the second argument. I often use ʻoperator.itemgetter
(what you know if you read Sort HOW TO in Chapter 2) Probably). We also use lambda expressions that return Boolean values. ..
from operator import itemgetter
a = [(3, 0), (4, 0), (2, 1)]
for k, g in groupby(a, key=itemgetter(1)):
print(k, list(g))
0 [(3, 0), (4, 0)]
1 [(2, 1)]
chain.from_iterable()
You can flatten
a quadratic array in one dimension.
from itertools import chain
a = [[1,2], [3,4], [5,6]
print(list(chain.from_iterable(a))
[1, 2, 3, 4, 5, 6]
zip_longest()
The built-in function zip ()
fits short iterables, but use it if you want it to fit long iterables. If you use it normally, it will be filled with None
, but you can specify the value to be used for filling with the second argument fillvalue
.
product()
Calculate the direct product. Besides, permutations ()
and combinations ()
are hard to implement by yourself, so I think it's worth knowing.
These are just a few of ʻitertools`, so if you're interested, read the Documentation (https://docs.python.org/ja/3/library/itertools.html).
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
If you're an object-oriented language enthusiast, you'll want to use classes, but that's up to the next chapter. You can also use Pandas as in Chapter 2, but I'll stop it because it seems to deviate from the intention of the problem statement.
Below is an example of the answer.
q30.py
import argparse
from itertools import groupby, islice
from pprint import pprint
import sys
def main():
parser = argparse.ArgumentParser()
parser.add_argument('num', type=int)
args = parser.parse_args()
for sent_lis in islice(read_mecab(sys.stdin), args.num-1, args.num):
pprint(sent_lis)
def read_mecab(fi):
for is_eos, sentence in groupby(fi, lambda line: line.startswith('EOS')):
if not is_eos:
yield list(map(line2dic, sentence))
def line2dic(line):
surface, info = line.rstrip().split('\t')
col = info.split(',')
dic = {'surface': surface,
'pos': col[0],
'pos1': col[1],
'base': col[6]}
return dic
if __name__ == '__main__':
main()
$ python q30.py 2 < neko.txt.mecab
[{'base':'\ u3000','pos':'symbol','pos1':'blank','surface':'\ u3000'}, {'base':'I'm','pos':'noun','pos1':'pronoun','surface':'I'}, {'base':'is','pos':'particle','pos1':'particle','surface':'is'}, {'base':'cat','pos':'noun','pos1':'general','surface':'cat'}, {'base':'da','pos':'auxiliary verb','pos1':'','surface':'in'}, {'base':'Are','pos':'Auxiliary verb','pos1':'','surface':'Are'}, {'base':'. ',' pos':' sign',' pos1':' punctuation',' surface':'. '}]
main ()
narrows down the output. If you use pprint.pprint ()
instead of print ()
, it will output (pretty print) with line breaks adjusted.
This type of format can be written smartly by passing a function that returns whether a line is ʻEOS to the
keyof
groupby (). You did
yield` in Chapter 2.
The problem is list (map ())
. map (func, iterable)
applies the function func
to each element of ʻiterableand returns an iterator. The result is the same as
[line2sic (x) for x in sentence], but it seems that Python is slow to call your own function in the
for sentence, so I adopted this notation ([[line2sic (x) for x in sentence]
. Reference](https://qiita.com/hi-asano/items/aa2976466739f280b887)).
Extract all the surface forms of the verb.
Below is an example of the answer.
q31.py
from itertools import islice
import sys
from q30 import read_mecab
def main():
for sent_lis in islice(read_mecab(sys.stdin), 5):
for word in filter(lambda x: x['pos'] == 'verb', sent_lis):
print(word['surface'])
if __name__ == '__main__':
main()
$ python q31.py < neko.txt.mecab
Born Tsuka Shi Crying Shi Is
I try to fetch the first N sentences. ʻArgpase` is also omitted because it has a troublesome smell.
I didn't have anything else to say, so I forcibly used filter ()
. This is equivalent to (x for x in iterable if condition (x))
and returns only the elements that match the condition. To be honest, it's enough to use the ʻif statement, so there aren't many turns (in this case, I think it's slow to use
filter () `).
Extract all the original forms of the verb.
Omitted because it is almost the same as 31.
Extract a noun phrase in which two nouns are connected by "no".
There is nothing to say because it can be done by pushing. Below is an example of the answer.
q33.py
from itertools import islice
import sys
from q30 import read_mecab
def main():
for sent_lis in islice(read_mecab(sys.stdin), 20):
for i in range(len(sent_lis) - 2):
if (sent_lis[i+1]['base'] == 'of' and sent_lis[i]['pos'] == 'noun'
and sent_lis[i+2]['pos'] == 'noun'):
print(''.join(x['surface'] for x in sent_lis[i: i+3]))
if __name__ == '__main__':
main()
$ python q33.py < neko.txt.mecab
His palm On the palm Student's face Should face In the middle of the face In the hole Calligraphy palm The back of the palm
I mentioned earlier that line continuation is \
in Python, but you can freely break lines inside parentheses. The conditional expression of the if statement is uselessly enclosed in ()
just because I want to say that.
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
Just do group by ()
with the part of speech. Below is an example of the answer.
q34.py
import sys
from itertools import groupby, islice
from q30 import read_mecab
def main():
for sent_lis in islice(read_mecab(sys.stdin), 20):
for key, group in groupby(sent_lis, lambda word: word['pos']):
if key == 'noun':
words = [word['surface'] for word in group]
if len(words) > 1:
print(''.join(words))
if __name__ == '__main__':
main()
$ python q34.py < neko.txt.mecab
In humans The worst Timely One hair Then the cat one time Puupuu and smoke
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
Just use collections.Counter
. Below is an example of the answer.
q35.py
import sys
from collections import Counter
from pprint import pprint
from q30 import read_mecab
def get_freq():
word_freq = Counter(word['surface'] for sent_lis in read_mecab(sys.stdin)
for word in sent_lis)
return word_freq.most_common(10)
if __name__ == '__main__':
pprint(get_freq())
$ python q35.py < neko.txt.mecab
[('No', 9194), ('。', 7486), ('Te', 6868), ('、', 6772), ('Ha', 6420), ('To', 6243), ('To', 6071), ('And', 5508), ('Ga', 5337), ('Ta', 3988)]
matplotlib
Now, I will draw a graph for the next problem, so it's finally time for this guy to come into play. Let's do pip install
. I don't really want to explain external modules in the article "Introduction to Python", and if I explain this seriously, I can write one book. First, read this Qiita article to understand the hierarchical structure of matplotlib. It's very messy, isn't it? It is Pyplot
that sets that side appropriately. This time, I will not make fine adjustments to the appearance, so I will do it this way. Here's a simple example.
import matplotlib.pyplot as plt
#Specify graph type and data
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
#Somehow set
plt.title('example')
plt.ylabel('some numbers')
#drawing
plt.show()
The import statement means "import the submodule pyplot
of the matplotlib
module with the name plt
".
First, decide on the type of graph. For line graphs, plt.plot ()
, for horizontal bar graphs, plt.barh ()
, for column graphs, plt.bar ()
, etc. It's actually converted to an array, so it might be better to pass a numpy
array).
Next, set the appearance of the kettle. plt.yticks ()
allows you to set the y coordinate and the characters to accompany it. You can use plt.xlim ()
to set the maximum and minimum values for the x-axis. plt.yscale ("log ")
makes the y-axis a logarithmic scale.
Finally, drawing. I'm using Jupyter, so I'm using plt.show ()
. Those who are running scripts should use plt.savefig (filename)
to output the file.
To be honest, this method is MATLAB-like and not Pythonic, but it is easy.
The default font does not support Japanese, so it will be tofu. Therefore, in order to display Japanese on the graph, you can set a font that supports Japanese, but depending on your environment, you may not find the Japanese font, or you may need to install it because it is not included in the first place. It's hard. japanize-matplotlib will make your life easier.
Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph).
from collections import Counter
from q30 import read_mecab
import matplotlib.pyplot as plt
import japanize_matplotlib
word_freq = Counter(word['base'] for sent_lis in read_mecab(open('neko.txt.mecab'))
for word in sent_lis)
word, count = zip(*word_freq.most_common(10))
len_word = range(len(word))
plt.barh(len_word, count, align='center')
plt.yticks(len_word, word)
plt.xlabel('frequency')
plt.ylabel('word')
plt.title('36.Top 10 most frequent words')
plt.show()
What is the *
in the zip ()
? But what I want to do here is transpose. That is, transforming data such as [[a, b], [c, d]]
into [[a, c], [b, d]]
. Therefore, the easiest way to transpose is to write zip (* seq)
. This is equivalent to zip (seq [0], seq [1], ...)
(Unpack Argument List # unpacking-argument-lists)). zip ([a, b], [c, d])
is [(a, c), (b, d)]
, isn't it? You can also use unpacked assignment to assign to different variables at once.
(This is the end of the explanation of the alternative solution of the ngram function in Chapter 1)
Keep in mind that the top words in word frequency are function words (particles, punctuation).
Display 10 words that often co-occur with "cat" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).
word_freq = Counter()
for sent_lis in read_mecab(open('neko.txt.mecab')):
for word in sent_lis:
if word['surface'] == 'Cat':
word_freq.update(x['base'] for x in sent_lis if x['surface'] != 'Cat')
break
words, count = zip(*word_freq.most_common(10))
len_word = range(len(words))
plt.barh(len_word, count, align='center')
plt.yticks(len_word, words)
plt.xlabel('frequency')
plt.ylabel('word')
plt.title('37.Top 10 words that frequently co-occur with "cat"')
plt.show()
You can update the Counter
object withCounter.update (iterable)
. However, if you do not focus on independent words, the result will be completely uninteresting.
Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).
word_freq = Counter(word['base'] for sent_lis in read_mecab(open('neko.txt.mecab'))
for word in sent_lis)
data = Counter(count for count in word_freq.values())
x, y = data.keys(), data.values()
plt.bar(x, y)
plt.title("38.histogram")
plt.xlabel("frequency")
plt.ylabel("number of the words")
plt.xlim(1, 30)
plt.show()
You can get all keys with dict.keys ()
and all values with dict.values ()
.
Looking at the number of word types, we can see that many words are infrequent. It looks like it is inversely proportional. In deep learning, low-frequency word processing is also important for this reason.
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
word_freq = Counter(word['base'] for sent_lis in read_mecab(open('neko.txt.mecab'))
for word in sent_lis)
_, count = zip(*word_freq.most_common())
plt.plot(range(1, len(count)+1), count)
plt.yscale("log")
plt.xscale("log")
plt.title("39.Zipf's law")
plt.xlabel("log(rank)")
plt.ylabel("log(frequency)")
plt.show()
The slope of the log-log graph is about -1. This means freq ∝ rank ^ (-1)
. It seems to be related to the 38th result. Let's google for more details on Zipf's law.
itertools
map()
, filter()
matplotlib
--Transposedict.keys()
, dict.values()
I will finally use the class. Is this the next and final introduction to Python?
(5/16) I wrote → https://qiita.com/hi-asano/items/5e18e3a5a711a752ad99
Recommended Posts