100 Language Processing Knock-82 (Context Word): Context Extraction

This is the record of the 82nd "Context Extraction" of Language Processing 100 Knock 2015. This time as well, the pre-processing system for the subsequent process did not perform any particularly difficult processing, and technically there is little explanation. However, the problem statement was difficult for an amateur to understand, and it took some time to understand.

Reference link

Link Remarks
082.Extraction of context.ipynb Answer program GitHub link
100 amateur language processing knocks:82 I am always indebted to you by knocking 100 language processing
100 language processing knock 2015 version(80~82) I referred to it in Chapter 9.

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

82. Extraction of context

For all words t that appear in the corpus created in> 81, write out all pairs of the word $ t $ and the context word $ c $ in tab-delimited format. However, the definition of contextual words is as follows.

-Extract $ d $ words before and after a word $ t $ as context word $ c $ (however, the context word does not include the word t itself) -Every time the word $ t $ is selected, the context width $ d $ is randomly determined within the range of {1,2,3,4,5}.

Problem supplement

What is a "contextual word"?

The target word is called ** "Target word" **, and the words before and after the target word are called ** "Context word" **. The number of words from the target word to the context word is called ** "context width" (Context Window Size or Window Size) **.

I will explain with the following example sentence in the original file of the assignment.

No surface details of Adrastea are known due to the low resolution of available images

For example, if * Adrastea * is the target word above, the preceding and following "details", "of", "are", and "known" are context words with a context width of 2. So, if you want to execute this task for the above sentence with the context width of 2, create the following file this time.

1 column name 2nd row
No surface
No details
surface No
surface details
surface of
details No
details surface
details of
details Adrastea

Answer

Answer program [082. Extraction of context.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83%88%E3 % 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 082.% E6% 96% 87% E8% 84% 88% E3% 81% AE% E6 % 8A% BD% E5% 87% BA.ipynb)

Although it is a short program of about 20 lines, it takes about 10 minutes to process due to the large amount of data. Also, please note that the created file is about 800MB in size and large. By the way, more than 90% are copies of Article "Amateur Language Processing 100 Knock: 82".

import random

with open('./081.corpus.txt') as file_in, \
     open('./082.context.txt', mode='w') as file_out:
    for i, line in enumerate(file_in):
        tokens = line.strip.split(' ')
        for j in range(len(tokens)):
            d = random.randint(1, 5)        #Context width d
            
            #Enumeration of words within d before and after
            for k in range(max(j - d, 0), min(j + d + 1, len(tokens))):
                
                #Do not output for yourself
                if j != k:
                    file_out.writelines(tokens[j]+'\t'+tokens[k]+'\n')
        if i < 4:
            print(len(tokens), tokens)
        else:
            print('\r Processding line: {0}'.format(i), end='')

Answer commentary

The code below is the main part. It is a loop of the number obtained by increasing / decreasing the context width d from the target word location j. However, if you simply increase or decrease it, the first word will be a negative number, and the last word will exceed the total number of words, so use the max and min functions to increase the width. I'm making adjustments.

#Enumeration of words within d before and after
for k in range(max(j - d, 0), min(j + d + 1, len(tokens))):
                
    #Do not output for yourself
    if j != k:
        file_out.writelines(tokens[j]+'\t'+tokens[k]+'\n')

Up to the 4th line, the number of target words and the processing target sentence are output to the console, and after that, the number of lines being processed is output.

if i < 4:
    print(len(tokens), tokens)
else:
    print('\r Processding line: {0}'.format(i), end='')

Spill story (tokenization failure)

This is a failure story about tokenization of statements. At first, I used the split function as shown below without thinking too much.

tokens = line.split()

However, some of the results were like this, and I noticed an error when I used Pandas later.

"b")("s"	"c
−	"b")("s"
−	"c

It should have been like this. At first glance, it looks like it is separated by spaces, and \ xa0 is used for the space-like parts. About \ xa0 from the previous article Touches a little.

known	k" = √("s"("s" − "a")("s" − "b")("s" − "c

So, to make it correct, I used the strip function to separate it with just a space.

tokens = line.strip.split(' ')

Recommended Posts

100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock (2020): 28
Language processing 100 knock-86: Word vector display
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing with Python Knock 2015
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 3 Word continuity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
Natural language processing 2 Word similarity
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 language processing knock-76 (using scikit-learn): labeling
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"