This article is the 21st day of Natural Language Processing # 2 Advent Calendar 2019. By the way, it's my birthday today. Please celebrate. ~~ M's book, such as setting a deadline for a birthday ~~
In the word embedding world, BERT has been rampant in the past year, and even ELMo is becoming less common. You may still want to use legacy distributed representations such as word2vec and GloVe. In addition, you may want to learn with the data you have (at least for me). So, I made a learning kit for word2vec / doc2vec / GloVe / fastText for myself, so I will publish it.
word2vec / doc2vec / fastText can learn the model of gensim and GloVe can learn the model of official implementation ..
I wrote how to use it in the README of each package, so here I will write about the design concept of the learning kit.
There are various libraries / packages of word distributed expressions, The assumed format of the dataset is different for each library, Whenever I write a preprocessing script that shapes it into a form suitable for the library The code gets dirty more and more. Therefore, we shared the iterator for reading the text data set. I try to format the data format suitable for each library in the function.
def train_*****_model(
output_model_path,
iter_docs,
**kwargs
)
For word2vec:
def train_word2vec_model(
output_model_path,
iter_docs,
size=300,
window=8,
min_count=5,
sg=1,
epoch=5
):
"""
Parameters
----------
output_model_path : string
path of Word2Vec model
iter_docs : iterator
iterator of documents, which are raw texts
size : int
size of word vector
window : int
window size of word2vec
min_count : int
minimum word count
sg : int
word2vec training algorithm (1: skip-gram other:CBOW)
epoch : int
number of epochs
"""
iter_docs is an iterator of word lists for each document.
Prepare an abstract class TextDatasetBase
that defines a dataset read API.
Arbitrary data set can be handled by implementing the read class of the data set that the user wants to use in a form that inherits this class.
class TextDatasetBase(ABC):
"""
a bass class for text dataset
Attributes
----------
"""
@abstractmethod
def iter_docs(self):
"""
iterator of documents
Parameters
----------
"""
yield None
Example dataset class for MARD
class MARDDataset(TextDatasetBase):
def __init__(self, word_tokenizer):
self.root_path = None
self.word_tokenizer = word_tokenizer
def iter_docs(self, dataset_path):
"""
get iterator of texts in one document
Parameters
----------
dataset_path: string
path to dataset
"""
self.root_path = Path(dataset_path)
reviews_json_fn = self.root_path / "mard_reviews.json"
with open(reviews_json_fn, "r") as fi:
for line in fi:
review_dict = json.loads(line, encoding="utf-8")
title = review_dict["reviewerID"]
text = review_dict["reviewText"]
yield self.word_tokenizer.tokenize(text)
I feel that pytorch's DataLoader
is about 200 million times more sophisticated than this, but this is what I came up with.
Please let me know if you have a better design.
Take word2vec as an example
git clone [email protected]:stfate/word2vec-trainer.git
cd word2vec-trainer
git submodule init
git submodule update
pip install -r requirements.txt
python train_text_dataset.py -o $OUTPUT_PATH --dictionary-path=$DIC_PATH --corpus-path=$CORPUS_PATH --size=100 --window=8 --min-count=5
model_path = "model/word2vec.gensim.model"
model = Word2Vec.load(model_path)
When training with a large data set such as Wikipedia, it may eat up memory and fall. investigation in progress.
It's fun to think about the API of the library
Recommended Posts