100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name

This is the record of the 96th "Extraction of vector related to country name" of Language processing 100 knock 2015. Extract only the country name from the Gensim version of the word vector saved in Knock 90th. It's technically easy, but the country name part is a bit tedious.

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
gensim 3.8.1
numpy 1.17.4
pandas 0.25.3


Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

96. Extraction of vector for country name

Extract only the vector related to the country name from the learning result of word2vec.

Problem supplement (about country name)

"Language processing 100 knock-81 (collective replacement): Dealing with country names consisting of compound words" I thought about using the country name file However, the file does not have a one-word country name (such as "England"). [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5%89 This is because I erased it with% 8A% E9% 99% A4). Once again, [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5 I added and used the country name deleted in% 89% 8A% E9% 99% A4).


Answer program [096. Extraction of vector for country name.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88 % E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /096.%E5%9B%BD%E5%90%8D%E3%81%AB % E9% 96% A2% E3% 81% 99% E3% 82% 8B% E3% 83% 99% E3% 82% AF% E3% 83% 88% E3% 83% AB% E3% 81% AE% E6 % 8A% BD% E5% 87% BA.ipynb)

import numpy as np
import pandas as pd
from gensim.models import Word2Vec

model = Word2Vec.load('./090.word2vec.model')

index = []
vector = []

with open('./096.countries.txt') as file_in:
    for line in file_in:
        country = line.rstrip().replace(' ', '_')
        except KeyError:

pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')

Answer commentary

I read the file line by line to get the country name vector and add it to the list. Spaces are replaced with underscores in "Language processing 100 knock-81 (Batch replacement): Dealing with country names consisting of compound words" Because I did the same thing. Some of them are not included in the corpus and some are excluded because they appear less frequently, so we use ʻexcept Key Error` to catch the error.

for line in file_in:
    country = line.rstrip().replace(' ', '_')
    except KeyError:

After that, put the country name as an index in DataFrame and output it as a file. 238 countries are output. Since the original file was in 416 countries, a little less than 60% of the word vectors exist.

pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')

