We have summarized Word2Vec using the Microsoft Cognitive Toolkit (CNTK).
In Part 1, we will prepare for Word2Vec using CNTK.
I will introduce them in the following order.
In natural language processing, the sentence data set to be processed is called a corpus.
This time, I used my own light novel "Magical Children" as a Japanese corpus. This work is published on the novel posting site "Kakuyomu" operated by KADOKAWA.
Combine the text, excluding the title and subtitle, into a single text data and save it as MagicalChildren.txt.
The directory structure this time is as follows.
Word2Vec MagicalChildren.txt stop_words.pkl word2vec_corpus.py
In natural language processing, text data preprocessing is one of the key factors for better analysis.
The prepared raw text data often contains noise. Text data scraped from the web can include HTML and JavaScript code.
These can be removed using the Python standard module re and regular expressions.
In natural language processing, the smallest unit that makes sense as a word is called a token. In English, it is relatively easy to understand because it is separated by a half-width space, but in Japanese, it is necessary to specify the word because it is not separated.
The open source morphological analysis engine [MeCab](# http: //taku910.github.io/mecab/) is a well-known method for dividing Japanese sentences into words. This time, I installed MeCab and split the words.
However, it is difficult to divide the coined words of recent years with MeCab alone. Therefore, word division is performed based on the NEologd dictionary, which also supports the latest coined words.
Normalization in preprocessing of natural language processing means unification of character notation. The processing is as follows.
-Replace half-width katakana numbers with full-width characters.
-Convert English uppercase letters to lowercase letters.
-Replace the numbers with N.
· Replace the low-frequency words unknown word \ to
Most of these can be solved by using the Python standard module re and regular expressions.
A stop word is a word that is necessary to make a sentence, but has no meaning in itself.
In Japanese, adverbs, particles, conjunctions, auxiliary verbs, interjections, etc. are applicable.
Word2Vec this time is not intended for sentence generation, so we will remove it using a set of stop words.
After applying the preprocessing to the prepared text data, assign an ID to the word to handle the word on a computer and create a word dictionary. At this time, little word frequency of occurrence is replaced with the word, such as \
Since we will use the word distribution during training this time, save the unigram word distribution as a Numpy file using the number of occurrences of the words included in the created word dictionary.
During this training, we will use CTFDeserializer, which is one of the built-in readers specializing in text files. For CTFDeserializer, see Computer Vision: Image Caption Part1 --STAIR Captions and Computer Vision: Image Caption Part2 --Neural Image Caption System.
The general process flow of the program preparing for Word2Vec is as follows.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Mecab 0.996 ・ Numpy 1.17.3
The implemented program is published on GitHub.
word2vec_corpus.py
I will extract and supplement some parts of the program to be executed.
Here are the steps to install MeCab on Windows 10. First, use pip to install MeCab.
> pip install MeCab
However, MeCab cannot be executed by this alone, so download mecab-0.996.exe from the official website below and execute it.
MeCab: Yet Another Part-of-Speech and Morphological Analyzer
Select UTF-8 as the character code.
First, install and launch your favorite Linux distribution from the Microsoft Store.
If you get the error WslRegisterDistribution failed with error 0x8007019e, then Windows Subsystem for Linux is not enabled, so from the Control Panel, check Programs and Features-> Enable or Disable Windows Features-> Windows Subsystem for Linux. Turn it on and restart.
If Linux can be started without any problem, set the user name and password and type the following command.
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install make automake autoconf autotools-dev m4 mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file
$ sudo sed -i -e 's%/lib/mecab/dic%/share/mecab/dic%' /usr/bin/mecab-config
After that, type the command according to Preparing to install mecab-ipadic-NEologd. If you're happy with it, type yes at the end and press Enter.
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ ./bin/install-mecab-ipadic-neologd -n
...
yes
Now you can use a dictionary that supports the latest coined words.
Text cleaning, word splitting, word normalization, and stopword removal are performed on the loaded Text file.
word2vec_corpus.py
tokenized_list = []
for t in text:
t = text_cleaning(t) # text cleaning
word_list = mecab.tokenize(t) # word tokenized
word_list = [re.sub(r"[0-9]+|[0-9].+[0-9]", NUM, word) for word in word_list] # word normalization
for w in word_list:
if w in stop_words:
word_list.remove(w) # remove stop words
tokenized_list.append(word_list)
To create a word dictionary, reuse the function of Computer Vision: Image Caption Part1 --STAIR Captions doing.
The Skip-gram we are training this time considers 5 words before and after, so the CTFDeserializer looks like this:
skipgram_corpus.txt
|word 254:1 |target 982:1
|word 254:1 |target 3368:1
|word 254:1 |target 2178:1
|word 254:1 |target 3368:1
|word 254:1 |target 2179:1
|word 254:1 |target 545:1
|word 254:1 |target 2180:1
|word 254:1 |target 3368:1
|word 254:1 |target 2181:1
|word 254:1 |target 254:1
|word 545:1 |target 3368:1
|word 545:1 |target 2178:1
|word 545:1 |target 3368:1
|word 545:1 |target 2179:1
|word 545:1 |target 254:1
|word 545:1 |target 2180:1
|word 545:1 |target 3368:1
|word 545:1 |target 2181:1
|word 545:1 |target 254:1
|word 545:1 |target 169:1
...
There is a one-to-one correspondence between input and output, and there are 10 target words for each input word.
When the program is executed, the word dictionary is created and the word distribution is saved as follows.
Number of total words: 6786
Number of words: 3369
Saved word2id.
Saved id2word.
Saved unigram distribution as sampling_weights.npy
Skip-gram
Now 10000 samples...
Now 20000 samples…
...
Now 310000 samples...
Number of samples 310000
Once you've created the word dictionary, saved the word distribution, and created the file to load with the built-in reader, you're ready to train, and Part 2 will use CNTK to train Word2Vec Skip-gram.
MeCab: Yet Another Part-of-Speech and Morphological Analyzer neologd/mecab-ipadic-neologd
Computer Vision : Image Caption Part1 - STAIR Captions Computer Vision : Image Caption Part2 - Neural Image Caption System
Recommended Posts