Introduction

Isn't it the most difficult thing for beginners (especially those who are self-taught) to collect a corpus (a large amount of sentences) when performing natural language processing using machine learning algorithms in Japanese?

Both "Deep Learning from scratch ❷ ~ Natural language processing ~" and other books, which are the subject of this article, basically deal with English corpora, and Japanese corpora with habits different from English. The current situation is that it is difficult to experience the processing of. (At least I had a lot of trouble because I couldn't collect any Japanese corpus.)

So, this time, I used the article of "livedoor news" (German communication), and probably if you are a machine learner, you got it once. I would like to implement a good book "Deep Learning from scratch ❷ ~ Natural language processing ~" that may be possible in Japanese.

Count-based natural language processing

This time, I will replace the corpus with Japanese and implement it for the following range of "Deep Learning ❷ made from scratch". In the case of Japanese, unlike English, preprocessing is troublesome, so please focus on that area.
Target
Subject: "Deep Learning from scratch ❷" Scope of this time: Chapter 2 Natural language and distributed representation of words 2.3 Count-based method ~ 2.4.5 Evaluation with PTB data set

Since the PTB dataset is an English dataset, this time it is an image of using a Japanese corpus instead of the PTB dataset.

environment

Mac OS（Mojave） Python3（Python 3.7.4） jupiter Notebook

0. Advance preparation

The original data is a text file created for each article delivery date, which is awkward as it is (probably more than 100 files), so first combine all the text files into one new text file .. You can combine multiple text files with the following command (for mac).

In the case of windows, should I replace "cat" with "copy"?

`Terminal`


$ cat ~/Directory name/*.txt >New text file name.txt

Reference
https://ultrabem-branch3.com/informatics/commands_mac/cat_mac [Digression] It's really just a digression, but personally I had a hard time with the above process. .. At first, I moved to the file and executed the command " cat * .txt> new text file name.txt ", but the process was completely finished probably because the directory name was not specified. (Maybe the wildcard was trying to read all the text files on my PC?), And at the end I got a bang-bang warning saying "Not enough space!" .. .. Please be careful.

1. 1. Data preprocessing

** ⑴ Text division **

`python`


import sys
sys.path.append('..')
import re
import pickle
from janome.tokenizer import Tokenizer
import numpy as np
import collections

with open("corpus/dokujo-tsushin/dokujo-tsushin-half.txt", mode="r",encoding="utf-8") as f: #Note 1)
    original_corpus = f.read()
    
text = re.sub("http://news.livedoor.com/article/detail/[0-9]{7}/","", original_corpus) #Note 2)
text = re.sub("[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+[0-9]{4}","", text) #Note 3)
text = re.sub("[\f\n\r\t\v]","", text)
text = re.sub("　","", text)
text = re.sub("[「」]","", text)
text = [re.sub("[（）]","", text)]

#<Point>
t = Tokenizer()

words_list = []
for word in text:
    words_list.append(t.tokenize(word, wakati=True))
    
with open("words_list.pickle",mode='wb') as f:
    pickle.dump(words_list, f)

Note 1) This time, unlike "Deep Learning ❷ made from scratch", we have prepared the corpus by ourselves, so we will read the target corpus. Also, this time it is reading "dokujo-tsushin-half.txt", but although I was originally trying to read "dokujo-tsushin-all.txt", a warning was issued that it could not be read due to over capacity. Therefore, I abandoned "all" and used "half" (⓪ the combination of pre-prepared text files performed only half of all files).

Before, when I made a chatbot, I was loading a larger amount of corpus, so maybe it was a problem with how to handle my corpus. Note 2) Since the URL of the article was included in all the corpora, I deleted it using a regular expression. Note 3) Since the posting date and time of the article was stated in all the corpora, it was deleted using a regular expression.

** ** Since Japanese natural language processing is performed this time, sentences cannot be decomposed into words by the method described in the book (split into words by spaces). So, this time, I installed janome, a third-party library, and decomposed (separated) Japanese sentences into words. Also, since the amount of sentences is large and the number of words as a result is large, I decided to use pickle to save the execution result so that I do not have to execute it from the place where I write each minute. The pickle file can be called by doing the following, and it can be loaded many times faster than splitting from scratch.


with open('words_list.pickle', mode='rb') as f:
    words_list = pickle.load(f)

print(words_list) #If you do not need to display the load result, this description is unnecessary

# =>output
#[['friend', 'representative', 'of', 'speech', '、', 'Germany', 'woman', 'Is', 'How', 'Doing', 'hand', 'Is', '？', 'soon', 'June', '・', 'Bride', 'When', 'Call', 'To be', 'June', '。', 'Germany', 'woman', 'of', 'During ~', 'To', 'Is', 'myself', 'of', 'formula', 'Is', 'yet', 'Nana', 'ofTo', 'Call', 'Re', 'hand', 'Just', '…', '…', 'WhenIU', 'celebration', 'Poverty', 'Status', 'of', 'Man', 'Also', 'Many', 'of', 'so', 'Is', 'NanaI', 'I wonder', 'U', 'Or', '？', 'SaらTo', 'Attendance', 'Number of times', 'To', 'Stack', 'hand', 'Go', 'When', '、', 'こHmmNana', 'Please', 'ごWhen', 'To', 'Sa', 'To be', 'こWhen', 'Also', '少Nanaく', 'NanaI', '。', 'Please', 'But', 'is there', 'Hmm', 'Is', 'but', '…', '…', 'friend', 'representative', 'of', 'speech', '、', 'Finally', 'hand', 'くRe', 'NanaI', 'Or', 'Nana', '？', 'Sahand', 'そHmmNana', 'Whenき', '、', 'Germany', 'woman', 'Is', 'How', 'Correspondence', 'Shi', 'Cod', 'Good', 'Or', '？', 'Recently', 'Is', 'When', 'the Internet', 'etc', 'so', 'Search', 'すRe', 'If', 'friend', 'representative', 'speech', 'for', 'of', 'Example sentence', 'site', 'But', 'TaくSaHmm', 'Out', 'hand', 'come', 'ofso', '、', 'そReら', 'To', 'reference', 'To', 'すRe', 'If', '、', 'Safe', 'Nana', 'Alsoof', 'Is', 'Who', 'soAlso', 'Create', 'soきる', '。', 'ShiOrShi', 'Yuri', 'SaHmm', '33', 'age', 'Is', 'Net', 'To', 'reference', 'To', 'Shi', 'hand', 'Create', 'Shi', 'Ta', 'Alsoofof', 'こRe', 'so', '本当To', 'Good', 'of', 'Or', 'anxiety', 'soShi', 'Ta', '。', '一Man暮らShi', 'Nana', 'ofso', '聞Or', 'Se', 'hand', 'Impressions', 'To', 'Ichi', 'hand', 'くTo be', 'Man', 'Also', 'I', 'NanaI', 'Shi', '、', 'Or', 'When', 'Ichi', 'hand', 'other', 'of', 'friend', 'To', 'Take the trouble', '聞Or', 'Seる', 'of', 'Also', 'How', 'Or', 'When',・ ・ ・ Omitted below

** ⑵ Create a list with IDs for words **

def preprocess(text):
    word_to_id = {}
    id_to_word = {}
    
    #<Point>
    for words in words_list:
        for word in words:
            if word not in word_to_id:
                new_id = len(word_to_id)
                word_to_id[word] = new_id
                id_to_word[new_id] = word
                
    corpus = [word_to_id[w] for w in words for words in words_list]
    
    return corpus, word_to_id, id_to_word

corpus, word_to_id, id_to_word = preprocess(text)

print('corpus size:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()
print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()
print("word_to_id['woman']:", word_to_id['woman'])
print("word_to_id['marriage']:", word_to_id['marriage'])
print("word_to_id['husband']:", word_to_id['husband'])

# =>output
# corpus size: 328831
# corpus[:30]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 5, 6, 2, 22, 23, 7, 24, 2]

# id_to_word[0]:friend
# id_to_word[1]:representative
# id_to_word[2]:of

# word_to_id['woman']: 6
# word_to_id['marriage']: 456
# word_to_id['husband']: 1453

Point
-The preprocess function is basically the same as this book, but this time the word division of the sentence has already been done, so that part has been deleted, and unlike this book, the for sentence is rotated twice. The point is to give an id. The reason why the for sentence is rotated twice is that the word is included in the double list because it is divided by the word division unlike this book.

2. Evaluation

I will omit the details of the following because there are many parts that overlap with the contents of the book, but I would appreciate it if you could refer to it because some comments are included.

Looking at the output results, there seems to be room for improvement. ..

#Creating a co-occurrence matrix
def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)
    
    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i
            
            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1
                
            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1
            
    return co_matrix

#Judgment of similarity between vectors (cos similarity)
def cos_similarity(x, y, eps=1e-8):
    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
    return np.dot(nx, ny)

#Ranking the similarity between vectors
def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(id_to_word)

    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

#Improving word relevance indicators using positive mutual information (PPMI)
def ppmi(C, verbose=False, eps = 1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100) == 0:
                    print('%.1f%% done' % (100*cnt/total))
    return M

window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
print('counting  co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)

print('calculating SVD ...')
try:
    #Dimensionality reduction with SVD using sklearn
    from sklearn.utils.extmath import randomized_svd
    U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
                             random_state=None)
except ImportError:
    U, S, V = np.linalg.svd(W)

word_vecs = U[:, :wordvec_size]

querys = ['Female', 'marriage', 'he', 'Mote']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

# =>Below, the output result
"""
[query]Female
male: 0.6902421712875366
Etc.: 0.6339510679244995
model: 0.5287646055221558
generation: 0.5057054758071899
layer: 0.47833186388015747

[query]marriage
love: 0.5706729888916016
Dating: 0.5485040545463562
Opponent: 0.5481910705566406
 ?。: 0.5300850868225098
Ten: 0.4711574614048004

[query]he
Girlfriend: 0.7679144740104675
boyfriend: 0.67448890209198
husband: 0.6713247895240784
parent: 0.6373711824417114
Former: 0.6159241199493408

[query]Mote
Ru: 0.6267833709716797
Consideration: 0.5327887535095215
Twink: 0.5280393362045288
Girls: 0.5190156698226929
bicycle: 0.5139431953430176
"""

[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①