Isn't it the most difficult thing for beginners (especially those who are self-taught) to collect a corpus (a large amount of sentences) when performing natural language processing using machine learning algorithms in Japanese?
Both "Deep Learning from scratch ❷ ~ Natural language processing ~" and other books, which are the subject of this article, basically deal with English corpora, and Japanese corpora with habits different from English. The current situation is that it is difficult to experience the processing of. (At least I had a lot of trouble because I couldn't collect any Japanese corpus.)
So, this time, I used the article of "livedoor news" (German communication), and probably if you are a machine learner, you got it once. I would like to implement a good book "Deep Learning from scratch ❷ ~ Natural language processing ~" that may be possible in Japanese.
This time, I will replace the corpus with Japanese and implement it for the following range of "Deep Learning ❷ made from scratch". In the case of Japanese, unlike English, preprocessing is troublesome, so please focus on that area.
Target
Subject: "Deep Learning from scratch ❷"
Scope of this time: Chapter 2 Natural language and distributed representation of words 2.3 Count-based method ~ 2.4.5 Evaluation with PTB data set
Mac OS(Mojave) Python3(Python 3.7.4) jupiter Notebook
The original data is a text file created for each article delivery date, which is awkward as it is (probably more than 100 files), so first combine all the text files into one new text file .. You can combine multiple text files with the following command (for mac).
Terminal
$ cat ~/Directory name/*.txt >New text file name.txt
Reference
https://ultrabem-branch3.com/informatics/commands_mac/cat_mac
[Digression]
It's really just a digression, but personally I had a hard time with the above process. ..
At first, I moved to the file and executed the command "
cat * .txt> new text file name.txt
", but the process was completely finished probably because the directory name was not specified. (Maybe the wildcard was trying to read all the text files on my PC?), And at the end I got a bang-bang warning saying "Not enough space!" .. .. Please be careful.
** ⑴ Text division **
python
import sys
sys.path.append('..')
import re
import pickle
from janome.tokenizer import Tokenizer
import numpy as np
import collections
with open("corpus/dokujo-tsushin/dokujo-tsushin-half.txt", mode="r",encoding="utf-8") as f: #Note 1)
original_corpus = f.read()
text = re.sub("http://news.livedoor.com/article/detail/[0-9]{7}/","", original_corpus) #Note 2)
text = re.sub("[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+[0-9]{4}","", text) #Note 3)
text = re.sub("[\f\n\r\t\v]","", text)
text = re.sub(" ","", text)
text = re.sub("[「」]","", text)
text = [re.sub("[()]","", text)]
#<Point>
t = Tokenizer()
words_list = []
for word in text:
words_list.append(t.tokenize(word, wakati=True))
with open("words_list.pickle",mode='wb') as f:
pickle.dump(words_list, f)
Note 1) This time, unlike "Deep Learning ❷ made from scratch", we have prepared the corpus by ourselves, so we will read the target corpus. Also, this time it is reading "dokujo-tsushin-half.txt", but although I was originally trying to read "dokujo-tsushin-all.txt", a warning was issued that it could not be read due to over capacity. Therefore, I abandoned "all" and used "half" (⓪ the combination of pre-prepared text files performed only half of all files).
**
with open('words_list.pickle', mode='rb') as f:
words_list = pickle.load(f)
print(words_list) #If you do not need to display the load result, this description is unnecessary
# =>output
#[['friend', 'representative', 'of', 'speech', '、', 'Germany', 'woman', 'Is', 'How', 'Doing', 'hand', 'Is', '?', 'soon', 'June', '・', 'Bride', 'When', 'Call', 'To be', 'June', '。', 'Germany', 'woman', 'of', 'During ~', 'To', 'Is', 'myself', 'of', 'formula', 'Is', 'yet', 'Nana', 'ofTo', 'Call', 'Re', 'hand', 'Just', '…', '…', 'WhenIU', 'celebration', 'Poverty', 'Status', 'of', 'Man', 'Also', 'Many', 'of', 'so', 'Is', 'NanaI', 'I wonder', 'U', 'Or', '?', 'SaらTo', 'Attendance', 'Number of times', 'To', 'Stack', 'hand', 'Go', 'When', '、', 'こHmmNana', 'Please', 'ごWhen', 'To', 'Sa', 'To be', 'こWhen', 'Also', '少Nanaく', 'NanaI', '。', 'Please', 'But', 'is there', 'Hmm', 'Is', 'but', '…', '…', 'friend', 'representative', 'of', 'speech', '、', 'Finally', 'hand', 'くRe', 'NanaI', 'Or', 'Nana', '?', 'Sahand', 'そHmmNana', 'Whenき', '、', 'Germany', 'woman', 'Is', 'How', 'Correspondence', 'Shi', 'Cod', 'Good', 'Or', '?', 'Recently', 'Is', 'When', 'the Internet', 'etc', 'so', 'Search', 'すRe', 'If', 'friend', 'representative', 'speech', 'for', 'of', 'Example sentence', 'site', 'But', 'TaくSaHmm', 'Out', 'hand', 'come', 'ofso', '、', 'そReら', 'To', 'reference', 'To', 'すRe', 'If', '、', 'Safe', 'Nana', 'Alsoof', 'Is', 'Who', 'soAlso', 'Create', 'soきる', '。', 'ShiOrShi', 'Yuri', 'SaHmm', '33', 'age', 'Is', 'Net', 'To', 'reference', 'To', 'Shi', 'hand', 'Create', 'Shi', 'Ta', 'Alsoofof', 'こRe', 'so', '本当To', 'Good', 'of', 'Or', 'anxiety', 'soShi', 'Ta', '。', '一Man暮らShi', 'Nana', 'ofso', '聞Or', 'Se', 'hand', 'Impressions', 'To', 'Ichi', 'hand', 'くTo be', 'Man', 'Also', 'I', 'NanaI', 'Shi', '、', 'Or', 'When', 'Ichi', 'hand', 'other', 'of', 'friend', 'To', 'Take the trouble', '聞Or', 'Seる', 'of', 'Also', 'How', 'Or', 'When',・ ・ ・ Omitted below
** ⑵ Create a list with IDs for words **
def preprocess(text):
word_to_id = {}
id_to_word = {}
#<Point>
for words in words_list:
for word in words:
if word not in word_to_id:
new_id = len(word_to_id)
word_to_id[word] = new_id
id_to_word[new_id] = word
corpus = [word_to_id[w] for w in words for words in words_list]
return corpus, word_to_id, id_to_word
corpus, word_to_id, id_to_word = preprocess(text)
print('corpus size:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()
print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()
print("word_to_id['woman']:", word_to_id['woman'])
print("word_to_id['marriage']:", word_to_id['marriage'])
print("word_to_id['husband']:", word_to_id['husband'])
# =>output
# corpus size: 328831
# corpus[:30]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 5, 6, 2, 22, 23, 7, 24, 2]
# id_to_word[0]:friend
# id_to_word[1]:representative
# id_to_word[2]:of
# word_to_id['woman']: 6
# word_to_id['marriage']: 456
# word_to_id['husband']: 1453
Point
-The preprocess function is basically the same as this book, but this time the word division of the sentence has already been done, so that part has been deleted, and unlike this book, the for sentence is rotated twice. The point is to give an id.
The reason why the for sentence is rotated twice is that the word is included in the double list because it is divided by the word division unlike this book.
I will omit the details of the following because there are many parts that overlap with the contents of the book, but I would appreciate it if you could refer to it because some comments are included.
#Creating a co-occurrence matrix
def create_co_matrix(corpus, vocab_size, window_size=1):
corpus_size = len(corpus)
co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)
for idx, word_id in enumerate(corpus):
for i in range(1, window_size + 1):
left_idx = idx - i
right_idx = idx + i
if left_idx >= 0:
left_word_id = corpus[left_idx]
co_matrix[word_id, left_word_id] += 1
if right_idx < corpus_size:
right_word_id = corpus[right_idx]
co_matrix[word_id, right_word_id] += 1
return co_matrix
#Judgment of similarity between vectors (cos similarity)
def cos_similarity(x, y, eps=1e-8):
nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
return np.dot(nx, ny)
#Ranking the similarity between vectors
def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
if query not in word_to_id:
print('%s is not found' % query)
return
print('\n[query] ' + query)
query_id = word_to_id[query]
query_vec = word_matrix[query_id]
vocab_size = len(id_to_word)
similarity = np.zeros(vocab_size)
for i in range(vocab_size):
similarity[i] = cos_similarity(word_matrix[i], query_vec)
count = 0
for i in (-1 * similarity).argsort():
if id_to_word[i] == query:
continue
print(' %s: %s' % (id_to_word[i], similarity[i]))
count += 1
if count >= top:
return
#Improving word relevance indicators using positive mutual information (PPMI)
def ppmi(C, verbose=False, eps = 1e-8):
M = np.zeros_like(C, dtype=np.float32)
N = np.sum(C)
S = np.sum(C, axis=0)
total = C.shape[0] * C.shape[1]
cnt = 0
for i in range(C.shape[0]):
for j in range(C.shape[1]):
pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)
M[i, j] = max(0, pmi)
if verbose:
cnt += 1
if cnt % (total//100) == 0:
print('%.1f%% done' % (100*cnt/total))
return M
window_size = 2
wordvec_size = 100
corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
print('counting co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)
print('calculating SVD ...')
try:
#Dimensionality reduction with SVD using sklearn
from sklearn.utils.extmath import randomized_svd
U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
random_state=None)
except ImportError:
U, S, V = np.linalg.svd(W)
word_vecs = U[:, :wordvec_size]
querys = ['Female', 'marriage', 'he', 'Mote']
for query in querys:
most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
# =>Below, the output result
"""
[query]Female
male: 0.6902421712875366
Etc.: 0.6339510679244995
model: 0.5287646055221558
generation: 0.5057054758071899
layer: 0.47833186388015747
[query]marriage
love: 0.5706729888916016
Dating: 0.5485040545463562
Opponent: 0.5481910705566406
?。: 0.5300850868225098
Ten: 0.4711574614048004
[query]he
Girlfriend: 0.7679144740104675
boyfriend: 0.67448890209198
husband: 0.6713247895240784
parent: 0.6373711824417114
Former: 0.6159241199493408
[query]Mote
Ru: 0.6267833709716797
Consideration: 0.5327887535095215
Twink: 0.5280393362045288
Girls: 0.5190156698226929
bicycle: 0.5139431953430176
"""