Distributed representations of words are commonly used in modern natural language processing. Recently, many trained models have been released, and it is less necessary to spend time and money to learn by yourself. However, even though it is open to the public, it takes a lot of time to find it and download it.
To eliminate this trouble, I made a downloader for word distribution expression. The name is ** chakin **. chakki-works/chakin (I feel motivated if you give me a star m (__) m)
The feature of chakin is that it is written in Python, can be installed with pip, can be done from search to download in one stop, and supports 23 vectors (as of May 29, 2017). .. We plan to increase the number of supported vectors in the future.
Let's see how to use it.
Installation is easy. Type the following command using pip:
$ pip install chakin
You can use it after installation. You need to write three lines of code to download the dataset. This time, let's try downloading the Japanese distributed representation data set. First, launch Python:
$ python
After launching Python, import the installed chakin. After importing, you can search for pretrained models by specifying the language (Japanese in this case) in the search method:
>>> import chakin
>>> chakin.search(lang='Japanese')
Name Dimension Corpus VocabularySize Method Language
6 fastText(ja) 300 Wikipedia 580K fastText Japanese
22 word2vec.Wiki-NEologd.50d 50 Wikipedia 335K word2vec + NEologd Japanese
Currently, it supports searching only the target language. This area is one of the places where we want to improve usability in the future.
Once you find the dataset you want to download, specify its index in the download method to download it. This time, I specified ** 22 **, which is the index of "word2vec.Wiki-NEologd.50d":
>>> chakin.download(number=22, save_dir='./')
Test: 100% || | Time: 0:00:02 60.7 MiB/s
'./latest-ja-word2vec-gensim-model.zip'
That's all for how to use it.
It currently supports the vectors listed below. We will continue to add corresponding vectors in the future, so please use it.
Name | Dimension | Corpus | VocabularySize | Method | Language |
---|---|---|---|---|---|
fastText(ar) | 300 | Wikipedia | 610K | fastText | Arabic |
fastText(de) | 300 | Wikipedia | 2.3M | fastText | German |
fastText(en) | 300 | Wikipedia | 2.5M | fastText | English |
fastText(es) | 300 | Wikipedia | 985K | fastText | Spanish |
fastText(fr) | 300 | Wikipedia | 1.2M | fastText | French |
fastText(it) | 300 | Wikipedia | 871K | fastText | Italian |
fastText(ja) | 300 | Wikipedia | 580K | fastText | Japanese |
fastText(ko) | 300 | Wikipedia | 880K | fastText | Korean |
fastText(pt) | 300 | Wikipedia | 592K | fastText | Portuguese |
fastText(ru) | 300 | Wikipedia | 1.9M | fastText | Russian |
fastText(zh) | 300 | Wikipedia | 330K | fastText | Chinese |
GloVe.6B.50d | 50 | Wikipedia+Gigaword 5 (6B) | 400K | GloVe | English |
GloVe.6B.100d | 100 | Wikipedia+Gigaword 5 (6B) | 400K | GloVe | English |
GloVe.6B.200d | 200 | Wikipedia+Gigaword 5 (6B) | 400K | GloVe | English |
GloVe.6B.300d | 300 | Wikipedia+Gigaword 5 (6B) | 400K | GloVe | English |
GloVe.42B.300d | 300 | Common Crawl(42B) | 1.9M | GloVe | English |
GloVe.840B.300d | 300 | Common Crawl(840B) | 2.2M | GloVe | English |
GloVe.Twitter.25d | 25 | Twitter(27B) | 1.2M | GloVe | English |
GloVe.Twitter.50d | 50 | Twitter(27B) | 1.2M | GloVe | English |
GloVe.Twitter.100d | 100 | Twitter(27B) | 1.2M | GloVe | English |
GloVe.Twitter.200d | 200 | Twitter(27B) | 1.2M | GloVe | English |
word2vec.GoogleNews | 300 | Google News(100B) | 3.0M | word2vec | English |
word2vec.Wiki-NEologd.50d | 50 | Wikipedia | 335K | word2vec + NEologd | Japanese |
Distributed representations of pre-learned words are common and important in natural language processing. However, it is surprisingly troublesome to find them by yourself. In this article, I introduced a downloader that I made to eliminate the trouble. We hope you find this article useful.
I also tweet information about machine learning and natural language processing on my Twitter account. @Hironsan
We look forward to your follow-up if you are interested in this area.
Recommended Posts