A note that I stumbled in Chapter 2 of "Deep Learning from scratch ❷ --- Natural language processing" is.
The execution environment is macOS Catalina + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.
This chapter begins the story of natural language processing.
Natural language processing is said to be "technology (field) for making computers understand our words", but when we hear "letting computers understand", the image expands and we imagine something like Doraemon. So, I think the expression "make it processable by computer" is good.
Numerical data can be easily processed by summing, averaging and comparing, visualizing with graphs, predicting the future with time series data, etc. You can also use the deep learning learned in the previous volume. But that's not the case with natural language data. It is a technology to make that possible.
Also, the abbreviation NLP is the same as Neuro-Linguistic Programming, and when you google with "NLP", the story of neuro-linguistic programming comes up first. I think it may be confusing because the fields are different, but it may be related for a moment when it comes out while studying deep learning. Please note that you may think that.
The book only speaks English, so make a note of the Japanese thesaurus.
--WordNet has a Japanese version of Japanese WordNet. However, it is unconfirmed whether it can be used with NLTK as in "Appendix B Running WordNet" in the book. ――The data available from the program is not open to the public, but the thesaurus built by the Japan Science and Technology Agency (JST) seems to be famous. There is a term search site called JST Thesaurus map. A graph like this is displayed. Not only the thesaurus but also the statistics of co-occurrence frequency in the literature are used. For example, if you search for "car", you will see a magnificent graph of a scale that you can not see without scrolling. You can follow the terms by double-clicking.
I studied the count-based method at Language Processing 100 Knock 2015 about 3 years ago, so I will review it. It became a form to do. Chapter 9: Vector Space Method (I) of this 100 knocks is described in "2.3 Count-based method" of this book. It corresponds to "2.4 Improvement of count-based method", so there was no particular stumbling block other than the singular value decomposition after this.
Figure 2-8 in "2.4.2 Dimensionality Reduction" may be a little confusing. If you didn't get the image from this figure, @aya_taka [Machine learning term "Dimensionality Reduction" in 30 minutes](https: // I think the example of height and weight at the beginning of qiita.com/aya_taka/items/4d3996b3f15aa712a54f) should be easy to understand.
I stumbled upon singular value decomposition (SVD). Actually, this is my third study (Machine Learning) of Coursera's online course I did about 4 years ago and the above-mentioned [Language]. Processing 100 knocks 2015](http://www.cl.ecei.tohoku.ac.jp/nlp100/)), I can understand the image, but I still don't understand the contents of the calculation. I don't understand the meaning of the explanation just by googled a little, and it seems that I have to re-study the procession exactly. NumPy (and the next scikit-learn) will do the math for me, so I'm grateful for that and decided to move on: sweat:
"2.4.4 PTB Dataset" uses the English PTB corpus as a large corpus, but I love Japanese, so I decided to try it in Japanese. Unlike English, Japanese does not have spaces at the boundaries of words, so it is necessary to process the word division with spaces, but this time it has been done Aozora Bunko's divided text / segavvy / wakatigaki-aozorabunko) is used.
In the book, the PTB corpus is used in dataset / ptb.py
, but I modified this to make dataset / aozorabunko.py
. Below is the source code, but before that, there are some notes.
――The target data is just 13 works selected by 3 authors and concatenated, so there is a considerable bias. Please note that it is not something that can be used as a benchmark for the method, but just "I tried it".
--In ptb.load_data ()
, you can select 'train'
, 'test'
, 'valid'
as arguments, but only'train'
used this time is supported yet. I'm thinking of bringing in a work by the same author that I haven't used this time.
--The len (corpus)
of the PTB corpus was 929,589, but the data from Aozora Bunko this time was 873,028. It's a little small.
-Notes when studying the first volume I wrote the source code from almost scratch, but those who are based on the code of the book Since it is difficult for me to refer to it, I changed the code of the book to the maximum. I put ★
in the main remodeling part. (It's a little painful that Linter's flake8 in Visual Studio Code is full of red wavy lines: disappointed :)
dataset/aozorabunko.py
# coding: utf-8
import sys
import os
sys.path.append('..')
try:
import urllib.request
except ImportError:
raise ImportError('Use Python3!')
import pickle
import numpy as np
#★ This URL is the download URL for the assorted Aozora Bunko works that have been uploaded to GitHub.
#For details https://github.com/segavvy/wakatigaki-Please refer to Aozora Bunko.
url_base = 'https://github.com/segavvy/wakatigaki-aozorabunko/raw/master/'
key_file = {
'train': '20200516merge.txt',
'test': '', #★ I haven't prepared it because I haven't used it yet
'valid': '' #★ I haven't prepared it because I haven't used it yet
}
save_file = {
'train': 'aozorabunko.train.npy',
'test': 'aozorabunko.test.npy',
'valid': 'aozorabunko.valid.npy'
}
vocab_file = 'aozorabunko.vocab.pkl'
dataset_dir = os.path.dirname(os.path.abspath(__file__))
def _download(file_name):
file_path = dataset_dir + '/' + file_name
if os.path.exists(file_path):
return
print('Downloading ' + file_name + ' ... ')
try:
urllib.request.urlretrieve(url_base + file_name, file_path)
except urllib.error.URLError:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
urllib.request.urlretrieve(url_base + file_name, file_path)
print('Done')
#★ Since the text division is used in two places, it is made into a function. Implementation is super ad hoc ...
def _split_data(text):
return text.replace('\n', '<eos> ').replace('。', '<eos> ').strip().split()
def load_vocab():
vocab_path = dataset_dir + '/' + vocab_file
if os.path.exists(vocab_path):
with open(vocab_path, 'rb') as f:
word_to_id, id_to_word = pickle.load(f)
return word_to_id, id_to_word
word_to_id = {}
id_to_word = {}
data_type = 'train'
file_name = key_file[data_type]
file_path = dataset_dir + '/' + file_name
_download(file_name)
words = _split_data(open(file_path).read())
for i, word in enumerate(words):
if word not in word_to_id:
tmp_id = len(word_to_id)
word_to_id[word] = tmp_id
id_to_word[tmp_id] = word
with open(vocab_path, 'wb') as f:
pickle.dump((word_to_id, id_to_word), f)
return word_to_id, id_to_word
def load_data(data_type='train'):
'''
:param data_type:Data type:'train' or 'test' or 'valid (val)'
:return:
'''
if data_type == 'val': data_type = 'valid'
save_path = dataset_dir + '/' + save_file[data_type]
word_to_id, id_to_word = load_vocab()
if os.path.exists(save_path):
corpus = np.load(save_path)
return corpus, word_to_id, id_to_word
file_name = key_file[data_type]
file_path = dataset_dir + '/' + file_name
_download(file_name)
words = _split_data(open(file_path).read())
corpus = np.array([word_to_id[w] for w in words])
np.save(save_path, corpus)
return corpus, word_to_id, id_to_word
if __name__ == '__main__':
for data_type in ('train', 'val', 'test'):
load_data(data_type)
Put this file in the dataset
directory, import ʻaozorabunko.py instead of
ptb.py, and ʻaozorabunko.load_data ()
instead of ptb.load_data ()
, You can use the data of Aozora Bunko instead of PTB corpus.
In addition, although the explanation of "2.4.5 Evaluation with PTB dataset" states that "sklearn module must be installed", this sklearn is scikit-learn. A Python machine learning library called /), which is included with Anaconda. Therefore, if you have installed Anaconda according to the procedure in Chapter 1 of the previous volume, you can use it without doing anything.
Also, it takes a lot of time to calculate PPMI. It takes several hours in my environment, so I changed it to cache it in a file once it is calculated. Also, I wanted to try various queries, so I made it possible to input as standard.
Below is the modified ch02 / count_method_big.py
. I put ★
in the main remodeling part.
ch02/count_method_big.py
# coding: utf-8
import sys
sys.path.append('..')
import numpy as np
from common.util import most_similar, create_co_matrix, ppmi
from dataset import aozorabunko #★ Changed to use the corpus of Aozora Bunko
import os #★ Added to cache PPMI calculation results
import pickle #★ Added to cache PPMI calculation results
window_size = 2
wordvec_size = 100
corpus, word_to_id, id_to_word = aozorabunko.load_data('train') #★ Change corpus
vocab_size = len(word_to_id)
print('counting co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
#★ PPMI calculation takes time, so cache the previous result and change it to reuse if C is the same
print('calculating PPMI ...')
W = None
ppmi_path = os.path.dirname(os.path.abspath(__file__)) + '/' + 'ppmi.pkl'
if os.path.exists(ppmi_path):
#★ Read cache
with open(ppmi_path, 'rb') as f:
cache_C, cache_W = pickle.load(f)
if np.array_equal(cache_C, C):
W = cache_W #Reuse because the contents of C are the same
if W is None:
W = ppmi(C, verbose=True)
with open(ppmi_path, 'wb') as f:
pickle.dump((C, W), f) #Save as cache
print('calculating SVD ...')
try:
# truncated SVD (fast!)
from sklearn.utils.extmath import randomized_svd
U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
random_state=None)
except ImportError:
# SVD (slow)
U, S, V = np.linalg.svd(W)
word_vecs = U[:, :wordvec_size]
#★ Change the query to standard input
while True:
query = input('\nquery? ')
if not query:
break
most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
Below are the results of trying some queries. First, from the Japanese translation of the book query.
[query]you
Wife: 0.6728986501693726
wife: 0.6299399137496948
K: 0.6205178499221802
father: 0.5986840128898621
I: 0.5941839814186096
[query]Year
Anti: 0.8162745237350464
hundred: 0.8051895499229431
Minutes: 0.7906433939933777
Eight: 0.7857747077941895
Circle: 0.7682645320892334
[query]car
door: 0.6294019222259521
Door: 0.6016885638237
Automobile: 0.5859153270721436
gate: 0.5726617574691772
curtain: 0.5608214139938354
Toyota is not found
"You" feels good. "Year" seems to have a synonym as a unit. "Car" is not good because it rarely appears in the works used for the corpus. "Toyota" doesn't exist in the first place, so it can't be helped.
Here are some other things I have tried. The first half is relatively good and the second half is not good.
[query]Morning
night: 0.7267987132072449
Around: 0.660172164440155
Noon: 0.6085118055343628
evening: 0.6021789908409119
Next time: 0.6002975106239319
[query]school
Tokyo: 0.6504884958267212
Higher: 0.6290650367736816
Junior high school: 0.5801640748977661
University: 0.5742003917694092
Boarding house: 0.5358142852783203
[query]Zashiki
Study: 0.6603355407714844
Sou side: 0.6362787485122681
Room: 0.6142982244491577
room: 0.6024710536003113
kitchen: 0.6014574766159058
[query]kimono
Beard: 0.5216895937919617
black: 0.5200990438461304
clothes: 0.5096032619476318
Clothes: 0.48781922459602356
hat: 0.4869200587272644
[query]I
master: 0.6372452974319458
Extra: 0.5826579332351685
Kaneda: 0.4684762954711914
they: 0.4676626920700073
Labyrinth: 0.4615904688835144
[query]Criminal
Phantom: 0.6609077453613281
Thieves: 0.6374931931495667
Member: 0.6308270692825317
that person: 0.6046633720397949
Dive: 0.5931873917579651
[query]order
Talk: 0.6200630068778992
Consultation: 0.5290789604187012
Busy: 0.5178924202919006
Kindness: 0.5033778548240662
Lecture: 0.4894390106201172
[query]Gunless gun
Obsolete: 0.7266454696655273
Old-fashioned: 0.6771457195281982
saw: 0.6735808849334717
Nose breath: 0.6516652703285217
ignorance: 0.650424063205719
[query]Cat
amen: 0.6659030318260193
Nobume: 0.5759447813034058
Ink: 0.5374482870101929
Status: 0.5352671146392822
usually: 0.5205280780792236
[query]Liquor
book: 0.5834404230117798
tea: 0.469807893037796
Rest: 0.4605821967124939
Eat: 0.44864168763160706
rod: 0.4349029064178467
[query]cuisine
Skein: 0.5380040407180786
Sign: 0.5214874744415283
original: 0.5175281763076782
Law: 0.5082278847694397
Shop: 0.5001937747001648
By the way, the authors of the target data are Soseki Natsume, Kenji Miyazawa, and Ranpo Edogawa. The corpus is a bit too biased, but that's interesting, so if you'd like, give it a try.
There were many reviews, so I could read it relatively smoothly. The next chapter is likely to be in production.
That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.
Recommended Posts