This is the record of the 91st "Preparation of analogy data" of Language processing 100 knock 2015. This time it is technically super easy because it is a pretreatment system for later knocking.
Link | Remarks |
---|---|
091.Preparation of analogy data.ipynb | Answer program GitHub link |
100 amateur language processing knocks:91 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Download Word Analogy Evaluation Data. The line starting with ":" in this data represents the section name. For example, the line ": capital-common-countries" marks the beginning of the section "capital-common-countries". From the downloaded evaluation data, extract the evaluation cases included in the section "family" and save them in a file.
"Analogy data" seems to be data for analogy.
The first 10 lines are shown below. A colon at the beginning, such as : capital-common-countries
, means a block, followed by ʻAthens Greece Baghdad Iraq` and the relationship between the capital and the country in two sets on one line.
In this way, it is data in which blocks and dozens of lines after that are arranged in two sets of one line. This time, we will extract the contents of the family block from this data.
questions-words.txt
: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba
with open('./questions-words.txt') as file_in, \
open('./091.analogy_family.txt', 'w') as file_out:
target = False #Target data
for line in file_in:
if target:
#In the case of target data, output until it becomes another section
if line.startswith(': '):
break
print(line.strip(), file=file_out)
elif line.startswith(': family'):
#Target data discovery
target = True
To be honest, I haven't done anything special technically, so I have no point to explain. If you force it, more than 90% is a copy of 100 amateur language processing knocks: 91. The first 10 lines of the resulting text are:
091.analogy_family.txt
boy girl brother sister
boy girl brothers sisters
boy girl dad mom
boy girl father mother
boy girl grandfather grandmother
boy girl grandpa grandma
boy girl grandson granddaughter
boy girl groom bride
boy girl he she
boy girl his her
Omitted thereafter
Recommended Posts