Introduction

If you want to use the learned word2vec when processing Japanese language List of ready-to-use word embedding vectors It was easy to get the trained data and use gensim like this. However, many things are learned from wikipedia, etc., and the number of words is about 50,000, so even if you try to divide an arbitrary sentence and vectorize it, it will be useless because it is just unknown words.

I thought that Japanese morphological analysis was mecab, but recently there is also Sudachi provided by Works Applications. In addition, a Word2Vec model that has been divided and learned in Sudachi is also provided, and it is learned data with 3.6 million words from the Kokugoken Japanese Web Corpus with a scale of 10 billion words. Use a trained Word2Vec model based on Sudachi It seems to be easy to use because there are quite a lot of words.

It's just a txt file with words and corresponding vectors written in it, but it's huge data with 5GB when compressed and about 12GB after decompression. There is no problem if the machine has a lot of memory for learning, but it may be difficult to expand 12G to memory in a low memory environment for inference.

However, if you just want to vectorize any word, you can call it from SSD (HDD) each time without expanding it in memory.

Installation of suda chipy

https://github.com/WorksApplications/SudachiPy You can install it as written on github, Depending on the environment, the suda chipy command may not be recognized by the terminal. To set sudachidict_full, run set_default_dict_package in config.py in the sudachidict directory.

Go to the installation destination sudachipy directory, add the following to the end of the file, and execute it in the terminal.

`config.py`


#~~~~~~~~~~~
#~~~~~~~~~~~
import sys
output = sys.stdout
set_default_dict_package('sudachidict_full',output)

Use learned data without expanding to memory

You can call any word vector at high speed by first remembering the memory location of each line by referring to this article. [Python] Examination of how to efficiently read a specific line by specifying a line number from a CSV file with a large number of rows and columns

Get the learned data from the URL and put it in your favorite place. Japanese word distributed expression by large-scale corpus and multiple particle size division

It is now possible to group words into a vector in a class, or to divide a sentence and then return a vectorized version of each.

`sudachi_w2v.py`


import numpy as np
import pickle
import re
import csv
import os
from sudachipy import tokenizer
from sudachipy import dictionary

class Sudachi_w2v:
    def __init__(self, path, sudachiDataPath="sudachiData.pickle"):
        f = open(path, 'r')
        self.file = f
        self.reader = csv.reader(f, delimiter=' ')
        #First create a list of contained words and a list of memory addresses (it takes a lot of time)
        #From the second time onward, read the pickle version
        if os.path.exists(sudachiDataPath):
            with open(sudachiDataPath, 'rb') as f:
                dataset = pickle.load(f)
            self.offset_list = dataset["offset_list"]
            self.emb_size = dataset["emb_size"]
            self.word2index = dataset["word2index"]
            self.ave_vec = dataset["ave_vec"]
        else:
            txt = f.readline()
            #Number of dimensions of distributed representation
            self.emb_size = int(txt.split()[1])
            #Returns the average vector when an unknown word comes
            self.ave_vec = np.zeros(self.emb_size, np.float)
            #Memory address list
            self.offset_list = []
            word_list = []
            count = 0
            maxCount = int(txt.split()[0])
            while True:
                count+=1
                self.offset_list.append(f.tell())
                if count % 100000 == 0:print(count,"/",maxCount)
                line = f.readline()
                if line == '':break
                line_list = line.split()
                word_list.append(line_list[0])
                self.ave_vec += np.array(line_list[-300:]).astype(np.float)
            self.offset_list.pop()
            self.ave_vec = self.ave_vec/count
            self.word2index = {v:k for k,v in enumerate(word_list)}

            dataset = {}
            dataset["offset_list"] = self.offset_list
            dataset["emb_size"] = self.emb_size
            dataset["word2index"] = self.word2index
            dataset["ave_vec"] = self.ave_vec
            with open(sudachiDataPath, 'wb') as f:
                pickle.dump(dataset, f)

        self.num_rows = len(self.offset_list)
        #Preparation of sudachi
        self.tokenizer_obj = dictionary.Dictionary().create()
        self.mode = tokenizer.Tokenizer.SplitMode.C

    #Vectorize words
    def word2vec(self, word):
        try:
            idx = self.word2index[word]
            result = self.read_row(idx)
            vec = np.array(result[-300:])
            return vec
        except:#If not in the word list
            print(word, ": out of wordlist")
    
    #After dividing the sentence, return each vector together with mat
    def sentence2mat(self, sentence):
        words = sentence.replace("　"," ").replace("\n"," ")
        words = re.sub(r"\s+", " ", words)
        input_seq = [m.surface().lower() for m in self.tokenizer_obj.tokenize(words, self.mode)]
        input_seq = [s for s in input_seq if s != ' ']

        mat = np.zeros((len(input_seq), self.emb_size))
        input_sentence = []
        for i, word in enumerate(input_seq):
            try:
                idx = self.word2index[word]
                result = self.read_row(idx)
                input_sentence.append(result[0])
                mat[i] = np.array(result[-300:])
            except:#Returns the average vector if not in the word list
                input_sentence.append("<UNK>")
                mat[i] = self.ave_vec
        return input_sentence, mat

    def __del__(self):
        self.file.close()

    def read_row(self, idx):
        self.file.seek(self.offset_list[idx])
        return next(self.reader)

The usage is as follows. First, create a contained word list and a memory address list. This takes quite some time. (Several tens of minutes) After creating it once, the creation result is made into a pickle, so from the second time onward, you can create an instance in a few seconds by reading the pickle.

`python`


path = '~Storage location of training data~/nwjc_sudachi_full_abc_w2v/nwjc.sudachi_full_abc_w2v.txt'
sudachi = Sudachi_w2v(path)

vec = sudachi.word2vec("Sudachi")
print(vec)
#['0.07975651' '0.08931299' '-0.06070593' '0.46959993' '0.19651023' ~

input_sentence, mat = sudachi.sentence2mat("If you give up, the match will end there")
print(input_sentence, mat)
#(['give up', 'Cod', 'There', 'so', 'Game over', 'Is', 'Yo'], array([[ 1.9788130e-02,  1.1190426e-01, -1.6153505e-01, ...,

Since the learned data of sudachi has a large number of words, most words can be converted and it seems to be easy to use.

Use Sudachipy's learned word2vec in a low memory environment

Introduction

Installation of suda chipy

config.py

Use learned data without expanding to memory

sudachi_w2v.py

python

`config.py`

`sudachi_w2v.py`

`python`