Natural Language: Word2Vec Part1 --Japanese Corpus

Target

We have summarized Word2Vec using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for Word2Vec using CNTK.

I will introduce them in the following order.

  1. Preparation of Japanese corpus
  2. Text data preprocessing
  3. Creating a word dictionary, word distribution, corpus
  4. Creating a file to be read by the built-in reader provided by CNTK

Introduction

Preparation of Japanese corpus

In natural language processing, the sentence data set to be processed is called a corpus.

This time, I used my own light novel "Magical Children" as a Japanese corpus. This work is published on the novel posting site "Kakuyomu" operated by KADOKAWA.

Combine the text, excluding the title and subtitle, into a single text data and save it as MagicalChildren.txt.

The directory structure this time is as follows.

Word2Vec  MagicalChildren.txt  stop_words.pkl  word2vec_corpus.py

Text data preprocessing

In natural language processing, text data preprocessing is one of the key factors for better analysis.

Text cleaning

The prepared raw text data often contains noise. Text data scraped from the web can include HTML and JavaScript code.

These can be removed using the Python standard module re and regular expressions.

Word split

In natural language processing, the smallest unit that makes sense as a word is called a token. In English, it is relatively easy to understand because it is separated by a half-width space, but in Japanese, it is necessary to specify the word because it is not separated.

The open source morphological analysis engine [MeCab](# http: //taku910.github.io/mecab/) is a well-known method for dividing Japanese sentences into words. This time, I installed MeCab and split the words.

However, it is difficult to divide the coined words of recent years with MeCab alone. Therefore, word division is performed based on the NEologd dictionary, which also supports the latest coined words.

Word normalization

Normalization in preprocessing of natural language processing means unification of character notation. The processing is as follows.

-Replace half-width katakana numbers with full-width characters. -Convert English uppercase letters to lowercase letters. -Replace the numbers with N. · Replace the low-frequency words unknown word \ to .

Most of these can be solved by using the Python standard module re and regular expressions.

Stop word removal

A stop word is a word that is necessary to make a sentence, but has no meaning in itself.

In Japanese, adverbs, particles, conjunctions, auxiliary verbs, interjections, etc. are applicable.

Word2Vec this time is not intended for sentence generation, so we will remove it using a set of stop words.

Creating word dictionaries, word distributions, corpora

After applying the preprocessing to the prepared text data, assign an ID to the word to handle the word on a computer and create a word dictionary. At this time, little word frequency of occurrence is replaced with the word, such as \ as an unknown word.

Since we will use the word distribution during training this time, save the unigram word distribution as a Numpy file using the number of occurrences of the words included in the created word dictionary.

Creating a file to be read by the built-in reader provided by CNTK

During this training, we will use CTFDeserializer, which is one of the built-in readers specializing in text files. For CTFDeserializer, see Computer Vision: Image Caption Part1 --STAIR Captions and Computer Vision: Image Caption Part2 --Neural Image Caption System.

The general process flow of the program preparing for Word2Vec is as follows.

  1. Preprocessing of text data
  2. Creating a word dictionary, word distribution, corpus
  3. Writing input words and predicted words

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Mecab 0.996 ・ Numpy 1.17.3

Program to run

The implemented program is published on GitHub.

word2vec_corpus.py


Commentary

I will extract and supplement some parts of the program to be executed.

Install MeCab

Here are the steps to install MeCab on Windows 10. First, use pip to install MeCab.

> pip install MeCab

However, MeCab cannot be executed by this alone, so download mecab-0.996.exe from the official website below and execute it.

MeCab: Yet Another Part-of-Speech and Morphological Analyzer

Select UTF-8 as the character code.

Install NEologd dictionary

First, install and launch your favorite Linux distribution from the Microsoft Store.

If you get the error WslRegisterDistribution failed with error 0x8007019e, then Windows Subsystem for Linux is not enabled, so from the Control Panel, check Programs and Features-> Enable or Disable Windows Features-> Windows Subsystem for Linux. Turn it on and restart.

If Linux can be started without any problem, set the user name and password and type the following command.

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install make automake autoconf autotools-dev m4 mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file
$ sudo sed -i -e 's%/lib/mecab/dic%/share/mecab/dic%' /usr/bin/mecab-config

After that, type the command according to Preparing to install mecab-ipadic-NEologd. If you're happy with it, type yes at the end and press Enter.

$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git

$ cd mecab-ipadic-neologd
$ ./bin/install-mecab-ipadic-neologd -n
...
yes

Now you can use a dictionary that supports the latest coined words.

Text data preprocessing

Text cleaning, word splitting, word normalization, and stopword removal are performed on the loaded Text file.

word2vec_corpus.py


tokenized_list = []
for t in text:
    t = text_cleaning(t)  # text cleaning
        
    word_list = mecab.tokenize(t)  # word tokenized

    word_list = [re.sub(r"[0-9]+|[0-9].+[0-9]", NUM, word) for word in word_list]  # word normalization

    for w in word_list:
        if w in stop_words:
            word_list.remove(w)  # remove stop words

    tokenized_list.append(word_list)

To create a word dictionary, reuse the function of Computer Vision: Image Caption Part1 --STAIR Captions doing.

Skip-gram input and output

The Skip-gram we are training this time considers 5 words before and after, so the CTFDeserializer looks like this:

skipgram_corpus.txt


|word 254:1	|target 982:1
|word 254:1	|target 3368:1
|word 254:1	|target 2178:1
|word 254:1	|target 3368:1
|word 254:1	|target 2179:1
|word 254:1	|target 545:1
|word 254:1	|target 2180:1
|word 254:1	|target 3368:1
|word 254:1	|target 2181:1
|word 254:1	|target 254:1
|word 545:1	|target 3368:1
|word 545:1	|target 2178:1
|word 545:1	|target 3368:1
|word 545:1	|target 2179:1
|word 545:1	|target 254:1
|word 545:1	|target 2180:1
|word 545:1	|target 3368:1
|word 545:1	|target 2181:1
|word 545:1	|target 254:1
|word 545:1	|target 169:1
...

There is a one-to-one correspondence between input and output, and there are 10 target words for each input word.

result

When the program is executed, the word dictionary is created and the word distribution is saved as follows.

Number of total words: 6786
Number of words: 3369

Saved word2id.
Saved id2word.

Saved unigram distribution as sampling_weights.npy

Skip-gram

Now 10000 samples...
Now 20000 samples…
...
Now 310000 samples...

Number of samples 310000

Once you've created the word dictionary, saved the word distribution, and created the file to load with the built-in reader, you're ready to train, and Part 2 will use CNTK to train Word2Vec Skip-gram.

reference

MeCab: Yet Another Part-of-Speech and Morphological Analyzer neologd/mecab-ipadic-neologd

Computer Vision : Image Caption Part1 - STAIR Captions Computer Vision : Image Caption Part2 - Neural Image Caption System

Recommended Posts

Natural Language: Word2Vec Part1 --Japanese Corpus
Natural Language: BERT Part1 --Japanese Wikipedia Corpus
Natural Language: Word2Vec Part3 --CBOW model
Natural Language: Word2Vec Part2 --Skip-gram model
Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus
Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus
[Natural language processing] Preprocessing with Japanese
Natural Language: Doc2Vec Part2 --Document Classification
Natural Language: ChatBot Part1-Twitter API Corpus
Natural Language: GPT --Japanese Generative Pretraining Transformer
Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer
Natural Language: BERT Part2 --Unsupervised pretraining ALBERT
Python: Natural language processing
RNN_LSTM2 Natural language processing
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Python: Natural language vector representation
Natural language processing 2 Word similarity
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
[Word2vec] Let's visualize the result of natural language processing of company reviews