We have summarized the document classification using the Microsoft Cognitive Toolkit (CNTK).
In Part 1, we will prepare for document classification using CNTK.
I will introduce them in the following order.
livedoor news corpus
・ German communication ・ IT life hack ・ Home appliances channel ・ Livedoor HOMME ・ MOVIE ENTER ・ Peachy ・ Esmax ・ Sports Watch ・ Topic news
This is a corpus consisting of 9 types of articles. Each article file is covered by a Creative Commons license that is prohibited from being displayed or modified.
Access the above page and download / unzip ldcc-20140209.tar.gz.
The directory structure this time is as follows.
Doc2Vec |―text |―... doc2vec_corpus.py Word2Vec
The text data preprocessing reuses the functions implemented in Natural Language: Word2Vec Part1 --Japanese Corpus.
For word splitting, use Mecab, which is based on the NEologd dictionary, to perform stopword removal.
Also, for model performance evaluation, 10 documents are separated from each category as verification data.
This time, following Computer Vision: Image Caption Part1 --STAIR Captions, words that did not appear more than once were replaced with UNK.
During training, we will use CTFDeserializer, one of CNTK's built-in leaders. This time, one category label is assigned to one document consisting of many words.
The general processing flow of the program preparing for Doc2Vec is as follows.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Mecab 0.996
The implemented program is published on GitHub.
doc2vec_corpus.py
I will extract and supplement some parts of the program to be executed.
The contents of the CTFDeserializer used for this training are as follows.
0 |word 346:1 |label 0:1
0 |word 535:1
0 |word 6880:1
...
1 |word 209:1 |label 0:1
1 |word 21218:1
1 |word 6301:1
...
The number on the far left represents one document, many words|One category label for documents consisting of words|Label is assigned.
When you run the program, the word dictionary is created and saved as follows.
Number of total words: 73794
Number of words: 45044
Saved word2id.
Saved id2word.
Now 1000 samples...
Now 2000 samples...
...
Now 7000 samples...
Number of training samples 7277
Number of validation samples 90
Now that you're ready to train, Part 2 will use CNTK to train Doc2Vec.
Computer Vision : Image Caption Part1 - STAIR Captions Natural Language : Word2Vec Part1 - Japanese Corpus
Recommended Posts