This is the record of the 91st "Preparation of analogy data" of Language processing 100 knock 2015. This time it is technically super easy because it is a pretreatment system for later knocking.

Reference link

Link	Remarks
091.Preparation of analogy data.ipynb	Answer program GitHub link
100 amateur language processing knocks:91	I am always indebted to you by knocking 100 language processing

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

91. Preparation of analogy data

Download Word Analogy Evaluation Data. The line starting with ":" in this data represents the section name. For example, the line ": capital-common-countries" marks the beginning of the section "capital-common-countries". From the downloaded evaluation data, extract the evaluation cases included in the section "family" and save them in a file.

Original link of Word analogy evaluation data word2vec /) is broken here, so I changed it here.

Problem supplement

"Analogy data" seems to be data for analogy. The first 10 lines are shown below. A colon at the beginning, such as : capital-common-countries, means a block, followed by ʻAthens Greece Baghdad Iraq` and the relationship between the capital and the country in two sets on one line. In this way, it is data in which blocks and dozens of lines after that are arranged in two sets of one line. This time, we will extract the contents of the family block from this data.

`questions-words.txt`


: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba

Answer

Answer Program [091. Preparation of Analogy Data.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /091.%E3%82%A2%E3%83%8A%E3%83%AD% E3% 82% B8% E3% 83% BC% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E6% BA% 96% E5% 82% 99.ipynb)

with open('./questions-words.txt') as file_in, \
       open('./091.analogy_family.txt', 'w') as file_out:

    target = False      #Target data
    for line in file_in:

        if target:

            #In the case of target data, output until it becomes another section
            if line.startswith(': '):
                break
            print(line.strip(), file=file_out)

        elif line.startswith(': family'):

            #Target data discovery
            target = True

Answer commentary

To be honest, I haven't done anything special technically, so I have no point to explain. If you force it, more than 90% is a copy of 100 amateur language processing knocks: 91. The first 10 lines of the resulting text are:

`091.analogy_family.txt`


boy girl brother sister
boy girl brothers sisters
boy girl dad mom
boy girl father mother
boy girl grandfather grandmother
boy girl grandpa grandma
boy girl grandson granddaughter
boy girl groom bride
boy girl he she
boy girl his her
Omitted thereafter

100 Language Processing Knock-91: Preparation of Analogy Data