We have summarized BERT using the Microsoft Cognitive Toolkit (CNTK).
Part 1 prepares for BERT.
I will introduce them in the following order.
This time, we will use Japanese Wikipedia as the Japanese corpus.
Download jawiki-latest-pages-articles-multistream.xml.bz2 from the link above. Then use wikiextractor to remove the markup language.
$ python ./wikiextractor-master/WikiExtractor.py ./jawiki/jawiki-latest-pages-articles-multistream.xml.bz2 -o ./jawiki -b 500M
The structure of the directory this time is as follows.
BERT |―jawiki jawiki-latest-pages-articles-multistream.xml.bz2 |―wikiextractor-master WikiExtractor.py ... bert_corpus.py Doc2Vec NMTT STSA Word2Vec
In addition to the pre-processing we have implemented so far, we have performed pre-processing such as normalizing the notation of brackets and punctuation marks and deleting spaces between kana and kanji.
For word division, create a subword model using sentencepiece [1]. In addition, \ [CLS], \ [SEP], \ [MASK] are defined as special words.
In BERT [2] pre-learning, the language model is trained as unsupervised learning by masking the sentences contained in the corpus, so create training data for that purpose.
In the Masked Language Model, we decided to replace 15% of the word sequence, with an 80% chance of leaving it as a special word \ [MASK], a 10% chance of a random word, and a remaining 10% chance of leaving it as it is. Put.
Also, this time we will use Sentence-Order Prediction [3] instead of Next Sentence Prediction.
・ CPU Intel (R) Core (TM) i7-7700 3.60GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Nltk 3.4.5 ・ Numpy 1.17.3 ・ Sentencepiece 0.1.91
The implemented program is published on GitHub.
bert_corpus.py
When the program is executed, a preprocessed sentence is written on each line, and a Japanese corpus is created with each topic separated by blank lines.
The Sentence Piece model is then trained to create jawiki.model and jawiki.vocab.
Finally, a text file is created to be read by CTFDeserializer for pre-learning.
Now that you are ready to train, Part 2 will use CNTK for unsupervised Japanese pre-learning.
Japanese Wikipedia wikiextractor
Recommended Posts