If you want to use the learned word2vec when processing Japanese language List of ready-to-use word embedding vectors It was easy to get the trained data and use gensim like this. However, many things are learned from wikipedia, etc., and the number of words is about 50,000, so even if you try to divide an arbitrary sentence and vectorize it, it will be useless because it is just unknown words.
I thought that Japanese morphological analysis was mecab, but recently there is also Sudachi provided by Works Applications. In addition, a Word2Vec model that has been divided and learned in Sudachi is also provided, and it is learned data with 3.6 million words from the Kokugoken Japanese Web Corpus with a scale of 10 billion words. Use a trained Word2Vec model based on Sudachi It seems to be easy to use because there are quite a lot of words.
It's just a txt file with words and corresponding vectors written in it, but it's huge data with 5GB when compressed and about 12GB after decompression. There is no problem if the machine has a lot of memory for learning, but it may be difficult to expand 12G to memory in a low memory environment for inference.
However, if you just want to vectorize any word, you can call it from SSD (HDD) each time without expanding it in memory.
https://github.com/WorksApplications/SudachiPy You can install it as written on github, Depending on the environment, the suda chipy command may not be recognized by the terminal. To set sudachidict_full, run set_default_dict_package in config.py in the sudachidict directory.
Go to the installation destination sudachipy directory, add the following to the end of the file, and execute it in the terminal.
config.py
#~~~~~~~~~~~
#~~~~~~~~~~~
import sys
output = sys.stdout
set_default_dict_package('sudachidict_full',output)
You can call any word vector at high speed by first remembering the memory location of each line by referring to this article. [Python] Examination of how to efficiently read a specific line by specifying a line number from a CSV file with a large number of rows and columns
Get the learned data from the URL and put it in your favorite place. Japanese word distributed expression by large-scale corpus and multiple particle size division
It is now possible to group words into a vector in a class, or to divide a sentence and then return a vectorized version of each.
sudachi_w2v.py
import numpy as np
import pickle
import re
import csv
import os
from sudachipy import tokenizer
from sudachipy import dictionary
class Sudachi_w2v:
def __init__(self, path, sudachiDataPath="sudachiData.pickle"):
f = open(path, 'r')
self.file = f
self.reader = csv.reader(f, delimiter=' ')
#First create a list of contained words and a list of memory addresses (it takes a lot of time)
#From the second time onward, read the pickle version
if os.path.exists(sudachiDataPath):
with open(sudachiDataPath, 'rb') as f:
dataset = pickle.load(f)
self.offset_list = dataset["offset_list"]
self.emb_size = dataset["emb_size"]
self.word2index = dataset["word2index"]
self.ave_vec = dataset["ave_vec"]
else:
txt = f.readline()
#Number of dimensions of distributed representation
self.emb_size = int(txt.split()[1])
#Returns the average vector when an unknown word comes
self.ave_vec = np.zeros(self.emb_size, np.float)
#Memory address list
self.offset_list = []
word_list = []
count = 0
maxCount = int(txt.split()[0])
while True:
count+=1
self.offset_list.append(f.tell())
if count % 100000 == 0:print(count,"/",maxCount)
line = f.readline()
if line == '':break
line_list = line.split()
word_list.append(line_list[0])
self.ave_vec += np.array(line_list[-300:]).astype(np.float)
self.offset_list.pop()
self.ave_vec = self.ave_vec/count
self.word2index = {v:k for k,v in enumerate(word_list)}
dataset = {}
dataset["offset_list"] = self.offset_list
dataset["emb_size"] = self.emb_size
dataset["word2index"] = self.word2index
dataset["ave_vec"] = self.ave_vec
with open(sudachiDataPath, 'wb') as f:
pickle.dump(dataset, f)
self.num_rows = len(self.offset_list)
#Preparation of sudachi
self.tokenizer_obj = dictionary.Dictionary().create()
self.mode = tokenizer.Tokenizer.SplitMode.C
#Vectorize words
def word2vec(self, word):
try:
idx = self.word2index[word]
result = self.read_row(idx)
vec = np.array(result[-300:])
return vec
except:#If not in the word list
print(word, ": out of wordlist")
#After dividing the sentence, return each vector together with mat
def sentence2mat(self, sentence):
words = sentence.replace(" "," ").replace("\n"," ")
words = re.sub(r"\s+", " ", words)
input_seq = [m.surface().lower() for m in self.tokenizer_obj.tokenize(words, self.mode)]
input_seq = [s for s in input_seq if s != ' ']
mat = np.zeros((len(input_seq), self.emb_size))
input_sentence = []
for i, word in enumerate(input_seq):
try:
idx = self.word2index[word]
result = self.read_row(idx)
input_sentence.append(result[0])
mat[i] = np.array(result[-300:])
except:#Returns the average vector if not in the word list
input_sentence.append("<UNK>")
mat[i] = self.ave_vec
return input_sentence, mat
def __del__(self):
self.file.close()
def read_row(self, idx):
self.file.seek(self.offset_list[idx])
return next(self.reader)
The usage is as follows. First, create a contained word list and a memory address list. This takes quite some time. (Several tens of minutes) After creating it once, the creation result is made into a pickle, so from the second time onward, you can create an instance in a few seconds by reading the pickle.
python
path = '~Storage location of training data~/nwjc_sudachi_full_abc_w2v/nwjc.sudachi_full_abc_w2v.txt'
sudachi = Sudachi_w2v(path)
vec = sudachi.word2vec("Sudachi")
print(vec)
#['0.07975651' '0.08931299' '-0.06070593' '0.46959993' '0.19651023' ~
input_sentence, mat = sudachi.sentence2mat("If you give up, the match will end there")
print(input_sentence, mat)
#(['give up', 'Cod', 'There', 'so', 'Game over', 'Is', 'Yo'], array([[ 1.9788130e-02, 1.1190426e-01, -1.6153505e-01, ...,
Since the learned data of sudachi has a large number of words, most words can be converted and it seems to be easy to use.
Recommended Posts