This is the record of the 81st "Countermeasures consisting of compound words" in Language Processing 100 Knock 2015. This time as well, following the previous Corpus shaping, the preprocessing system is used, and the main processing is character replacement using regular expressions. However, I am doing the troublesome work manually in the part of making the country name list. Because of that, programming itself is not difficult, but it took time.
Link | Remarks |
---|---|
081.Dealing with country names consisting of compound words.ipynb | Answer program GitHub link |
100 amateur language processing knocks:81 | I am always indebted to you by knocking 100 language processing |
100 language processing knock 2015 version(80~82) | Chapter 9 was helpful |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.
Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).
This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.
In English, the concatenation of multiple words can make sense. For example, the United States is expressed as "United States" and the United Kingdom is expressed as "United Kingdom", but the words "United", "States", and "Kingdom" alone are ambiguous in the concept or substance they point to. Therefore, we would like to estimate the meaning of a compound word by recognizing the compound word contained in the corpus and treating the compound word as one word. However, it is very difficult to accurately identify compound words, so here we would like to identify a country name consisting of compound words.
Get your own list of country names on the Internet and replace spaces with underscores for the country names of compound words that appear in the 80 corpora. For example, "United States" should be "United_States" and "Isle of Man" should be "Isle_of_Man".
It is troublesome to "get the country name list by yourself from the Internet" ...
I thought that the page "Country codes / names" would be good, but the "Isle of" in the problem statement There is no Man ". "Isle of Man" seems to be in ISO 3166-1, so [Wikipedia's ISO 3166-1] I got the list from (https://en.wikipedia.org/wiki/ISO_3166-1). In other words, we are creating a country name list from the following three.
" Country codes / names "](http://www.fao.org/countryprofiles/iso3list/en/) Short name
column" Country codes / names "](http://www.fao.org/countryprofiles/iso3list/en/) Official name
columnSome names obtained from the ʻOfficial namecolumn of ["Country codes / names"](http://www.fao.org/countryprofiles/iso3list/en/) are prefixed with
the`. I removed it later because it was an obstacle.
Since I got it from three, some country names are duplicated, so I deleted the duplicates.
The theme this time is "Country name consisting of compound words", and a single word country name is not required. I did = COUNTIF (A1," * * ")
on EXCEL and judged the country name with a space between them as a compound word, and removed the country name whose EXCEL function result was 0.
Some of them cannot be used as they are, so I made fine adjustments manually. It takes time ... The following is an example.
Former | After change |
---|---|
Bolivia (Plurinational State of) | Plurinational State of Bolivia |
Cocos (Keeling) Islands | Cocos Keeling Islands Cocos Keeling Cocos Islands Keeling Islands |
In the end, 247 country names were created.
It is a program. The process is short and trivial (I spend a couple of hours making it due to lack of skills ...).
However, it takes about 12 minutes to perform a full-text search and replace as many as 247 country names. Article "100 knocks of language processing 2015 version (80-82)" It is faster if you use the sed
command. Is not it?
import re
#Remove line feed code from file and prefix word number for sorting
with open('./081.countries.txt') as countires:
country_num = [[len(country.split()), country.rstrip('\n')] for country in countires]
country_num.sort(reverse=True)
with open('./080.corpus.txt') as file_in:
body = file_in.read()
for i, country in enumerate(country_num):
print(i, country[1])
regex = re.compile(country[1], re.IGNORECASE)
body = regex.sub(country[1].replace(' ', '_'), body)
with open('./081.corpus.txt', mode='w') as file_out:
file_out.write(body)
The country name list file is read, the number of words is added to the list, and the sort is done in descending order. This is because "United States of America" is replaced with "United_States", which has a smaller number of words, and is not "United_States of America".
#Remove line feed code from file and prefix word number for sorting
with open('./081.countries.txt') as countires:
country_num = [[len(country.split()), country.rstrip('\n')] for country in countires]
country_num.sort(reverse=True)
By using re.INGNORECASE
in the regular expression, it is replaced without being case-sensitive (I have not confirmed whether this fluctuation is useful).
regex = re.compile(country[1], re.IGNORECASE)
Recommended Posts