Answers and impressions of 100 language processing knocks-Part 1

The first day of the advent calendar, but write something that is easy to write without any special feeling

this is

Since I solved 100 knocks on language processing, I will write the answer and impression one by one (currently 11/30 PM 20:30) So the part that can be written is the first part)

Prerequisites

--Environment -Dockerfile link (including irrelevant items) --Ability ――I have touched mecab, gensim without knowing anything --How to solve ――For the time being, I solved it by myself and googled every 10 questions to confirm what I was worried about. ――So, there is a part to put two answers --Excuse ――The correction of the wrong answer that I noticed during the compilation is not in time --Thank you ――I am very grateful for those who are self-taught and go through the darkness to publish such teaching materials.

Reflection

I had to reread the code myself to draw this article, so it was a pseudo code review

--Mixed " and' --A mixture of r in row and l in line

Many improvements were found

Main story

Chapter 1: Preparatory movement

00 Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

## 00
smt = "stressed" 
ans = ""
for i in range(len(smt)):
    ans += smt[-i - 1]
print(ans)

The writing style I didn't know is ↓

## 00
smt = "stressed"
smt[::-1]

In other words, it was list [start: stop: step]

01 "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

## 1
smt = "Patatoku Kashii"
''.join([smt[i] for i in range(len(smt)) if i % 2==0])

I didn't know at this point, so I'll rewrite this too ↓

## 1
smt = "Patatoku Kashii"
smt[::2]

02 "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

## 2
smt1 = "Police car"
smt2 = "taxi"
''.join([p + t for p, t in zip(smt1, smt2)])

There is a feeling of gritty

03 Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance.

## 3
smt = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
[len(w) - w.count(',') - w.count('.') for w in smt.split(' ')]

I wanted to write it well using ʻisalpha ()`, but I couldn't do the double loop in the comprehension, so I temporarily used this as the answer.

04 element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters. Create.

## 4
smt = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
dic = {}
target_index = [1, 5, 6, 7, 8, 9, 15, 16, 19]
for i, w in enumerate(smt.split(' ')):
    if i + 1 in target_index:
        dic[i + 1] = w[0]
    else:
        dic[i + 1] = w[:2]
dic

Is it okay to make a target or a definite decision? Is it okay to divide by if? Or suspicious demon is too dangerous, but proceed as it is

05 n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

## 5
def get_n_gram(n, smt):
    words = smt.split(' ')
    return  [smt[i:i+n] for i in range(len(smt) - n + 1)], [' '.join(words[i:i+n]) for i in range(len(words) -n + 1)]

get_n_gram(3, "I am an NLPer")

I thought that I could write it well with slices, so it may be better to separate the letters and words

06 Set

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

## 6
smt1 = "paraparaparadise"
smt2 = "paragraph"
X = set()
for i in range(len(smt1) - 2 + 1):
    X.add(smt1[i:i+2])
Y = set()
for i in range(len(smt2) - 2 + 1):
    Y.add(smt2[i:i+2])
    
print(X | Y)
print(X & Y)
print(X - Y)
print('se' in (X and Y))
print('se' in (X or Y))

Let's call it with inclusion notation ... I reconfirmed that Set can erase duplicates, and when I want something unique from list, it may be set once or ant

07 Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.

## 7
def get_template(x, y, z):
    return "{}of time{}Is{}".format(x, y, z)

get_template(12, 'temperature', 22.4)

I could do this because I usually use it, but I often forget how to specify the position with {0} etc.

08 Ciphertext

Implement the function cipher that converts each character of the given character string according to the following specifications. Replace with (219 --character code) characters in lowercase letters Output other characters as they are Use this function to encrypt / decrypt English messages.

## 8
class Coder:
    def __init__():
        pass
    
    def encode(smt):
        code = ""
        for i in range(len(smt)):
            if smt[i] .isalpha() and smt[i].islower():
                code += chr(219 - ord(smt[i]))
            else:
                code += smt[i]
        return code
    
    def decode(code):
        stm = ""
        for i in range(len(code)):
            if code[i] .isalpha() and code[i].islower():
                smt += chr(219 - ord(code[i]))
            else:
                smt += code[i]
        return smt

coder = Coder
smt =  "I couldn't believe that"
code = coder.encode(smt)
desmt = coder.encode(code)
print(smt)
print(code)
print(desmt)

I had a bad eye and mistaken cipher as a coder until this moment, and it was a function, not a class. And I will forget the character code no matter how many times I look it up, so I want to summarize it next time

09 Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

## 9
import random
def feel_typoglycemia(smt):
    typogly = []
    for w in smt.split(' '):
        if len(w) <= 4:
            typogly.append(w)
        else:
            mid = list(w)[1:-1]
            random.shuffle(mid)
            typogly.append(w[0] + ''.join(mid) + w[-1])
    return ' '.join(typogly)

smt = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
feel_typoglycemia(smt)

I removed only the beginning and end of the slice, mixed it, and attached it. I don't know the origin of the function name because I don't remember it at that time, but I don't plan to publish it, so it seems that I attached it with the feelings at that time.

Chapter 2: UNIX Command Basics

I thought that the title was to check the command, but the UNIX command was just for checking the execution result of the program.

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

Counting 10 lines

Count the number of lines. Use the wc command for confirmation.


## 10
with open('./hightemp.txt',) as f:
    print(len([r for r in f.read().split('\n') if r is not '']))

## 10
cat hightemp.txt | wc -l

I think r is from row, but after this it mixes with l in line

11 Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

## 11
with open('./hightemp.txt',) as f:
    print([r.replace('\t', ' ') for r in f.read().split('\n') if r is not ''])

## 11
cat hightemp.txt | sed "s/\t/\ /g"
## 11
cat hightemp.txt | tr "\t" "\ "
## 11
expand -t 1 hightemp.txt

I understand that sed is a guy I often use in vim, tr, expand was learning

12 Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

## 12
with open('./hightemp.txt',) as f:
    table = [r for r in f.read().split('\n') if r is not '']
    
with open('col1.txt', mode='w') as f:
    for t in table:
        f.write(t.split('\t')[0] + '\n')
with open('col2.txt', mode='w') as f:
    for t in table:
        f.write(t.split('\t')[1] + '\n')

## 12
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 1 -d " " > col1.txt
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 2 -d " " > col2.txt

I did it honestly without knowing the image of vertical operation

13 Merge col1.txt and col2.txt

Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

## 13
with open('cols.txt', mode='w') as c:
    with open('col1.txt') as f:
        with open('col2.txt') as ff:
            r1 = f.readline()
            r2 = ff.readline()
            c.write(r1.replace('\n', '') + '\t' + r2)
            while r1:
                while r2:
                    r1 = f.readline()
                    r2 = ff.readline()
                    c.write(r1.replace('\n', '') + '\t' + r2)

## 13
paste col1.txt col2.txt > cols.txt
cat cols.txt

A feeling of uselessness that exudes from f and ff paste was learning

14 Output N lines from the beginning

Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.

## 14
n = 5
with open('./hightemp.txt') as f:
    lines = f.read()
for l in lines.split('\n')[:n]:
    print(l)

head -n 5 hightemp.txt

It's a clear mistake because I forgot the command line argument part, I will add the one using sys.argv

15 Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

## 15
n = 5
with open('./hightemp.txt') as f:
    lines = f.read()
for l in lines.split('\n')[-n:]:
    print(l)

## 15
tail -n 5 hightemp.txt

Similarly, I forgot the command line argument part, so it is a clear wrong answer, I will add the one using sys.argv

16 Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

## 16
import math
with open('./hightemp.txt') as f:
    obj = f.read()
lines = [ l for l in obj.split('\n')]
n = 3
ni = math.ceil(len(lines) / n)
for i in range(0, len(lines), ni):
    j = i + ni
    print(len(lines[i:j]))

## 16
split -n 5 hightemp.txt

Similarly, I forgot the command line argument part, so it is a clear wrong answer, I will add the one using sys.argv

17 Difference in character string in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

## 17
with open('./hightemp.txt') as f:
    obj = f.read()
set(row.split('\t')[0] for row in obj.split('\n') if not row =='')

## 17
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 1 -d " "  | sort | uniq

It's the first time I've connected with a pipe like this, so I knew the joy of One Liner

18 Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

## 18
with open('./hightemp.txt') as f:
    obj = f.read()
rows = [row for row in obj.split('\n') if not row =='']
sorted(rows, key=lambda x:  -1 * float(x.split('\t')[2]))

## 18
cat hightemp.txt | sed "s/\t/\ /g" | sort -r -k 3 -t " "

float needed a cast

19 Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

## 19
with open('./hightemp.txt') as f:
    obj = f.read()
    
rows =[row.split('\t')[0] for row in obj.split('\n') if not row =='']
c_dic= {}
for k in set(rows):
    c_dic[k] = rows.count(k)
sorted(c_dic.items(), key=lambda x: -x[1])

## 19
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 1 -d " " | sort  | uniq -c | sort -rn -k 3 -t " "

It is a reflection point to be r or row

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. Information of one article per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.

I feel like I've slipped through without using regular expressions too much

wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/jawiki-country.json.gz

Since it was running on the juypter notebook, run it with a! At the beginning.

20 Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

## 20
import json, gzip
with gzip.open('jawiki-country.json.gz', 'rt') as f:
    obj = json.loads(f.readline())
    while(obj):
        try:
            obj = json.loads(f.readline())
            if obj['title'] == "England":
                break
        except:
            obj = f.readline()

I didn't know gzip completely so I learned

21 Extract rows containing category names

Extract the line that declares the category name in the article.

## 21
for l in obj['text'].split('\n'):
    if 'Category' in l:
        print(l)

More strict conditions may be better

22 Extraction of category names

Extract the article category names (by name, not line by line).

## 22
import re
head_pattern = r'\[\[Category:'
tail_pattern = r'\|?\*?\]\]'
for l in obj['text'].split('\n'):
    if 'Category' in l:
        l = re.sub(head_pattern, '', l)
        print(re.sub(tail_pattern, '', l))

I wrote it on the gorigori

23 section structure

Display the section names and their levels contained in the article (for example, 1 if "== section name ==").

## 23
pattern = '=='
for l in obj['text'].split('\n'):
    if pattern in l:
        pat_by_sec = ''.join([r'=' for i in range(int(l.count('=') / 2 ))])
        sec = len(pat_by_sec) - 1
        tab = ''.join(['\t' for i in range(sec - 1)])
        print('{}{}. {}'.format(tab, sec, l.replace('=', '')))

It's a little detour because I wanted to indent on the tab when displaying

24 Extracting file references

Extract all media files referenced from the article.

## 24
for l in obj['text'].split('\n'):
    if 'File' in l:
        print(l.split(':')[1].split('|')[0])

A more strict if statement may be better here as well

25 Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

## 25
import re
pattern = r' = '
basic_info = {}
for l in obj['text'].split('\n'):
    if pattern in l:
        basic_info[l.split(' = ')[0].replace('|', '')] = l.split(' = ')[1]
basic_info

It seems that it is not good to connect methods often when processing text

26 Elimination of highlighted markup

When processing> 25, remove MediaWiki's emphasized markup (all weak, emphasized, and strongly emphasized) from the template value and convert it to text (reference: markup quick reference table).

## 26
import re
pattern = r' = '
basic_info = {}
for l in obj['text'].split('\n'):
    if pattern in l:
        basic_info[l.split(' = ')[0].replace('|', '')] = l.split(' = ')[1].replace('\'', '')
basic_info

I started thinking that I would like to proceed with hard coding without demanding versatility in text processing.

27 Removal of internal links

In addition to processing> 26, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).

## 27
import re
pattern = r' = '
med_link = r'\[|\]'

basic_info = {}
for l in obj['text'].split('\n'):
    if pattern in l:
        val =  l.split(' = ')[1].replace('\'', '')
        val =  re.sub(med_link, '', val)
        basic_info[l.split(' = ')[0].replace('|', '')] = val 
basic_info

While watching the output, I was making corrections on an ad hoc basis

28 MediaWiki Markup Removal

In addition to processing> 27, remove MediaWiki markup from the template values as much as possible and format the basic country information.

## 28
import re
pattern = r' = '
med_link = r'\[|\]'
strong = r'\{|\}'
tag = r'\<+.*\>'

basic_info = {}
for l in obj['text'].split('\n'):
    if pattern in l:
        val =  l.split(' = ')[1].replace('\'', '')
        val =  re.sub(med_link, '', val)
        val =  re.sub(strong, '', val)
        val =  re.sub(tag, '', val)
        basic_info[l.split(' = ')[0].replace('|', '')] = val 
basic_info

Give up quickly because it was "as much as possible"

29 Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

## 29
import requests
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
    "action": "query",
    "format": "json",
    "prop": "imageinfo",
    "iiprop": "url",
    "titles": "File:" + basic_info['National flag image']
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA["query"]["pages"]
for k, v in PAGES.items():
    for kk, vv in v.items():
        if kk == 'imageinfo':
            print(vv[0]['url'])

I hit the api with reference to the reference code

Chapter 4: Morphological analysis

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.

wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/neko.txt

import MeCab
t = MeCab.Tagger()
with open('./neko.txt') as f:
    text = f.read()
with open('./neko.txt.mecab', mode='w') as f:
    f.write(t.parse(text))

Until now, analysis was performed in a series of processes without saving the analysis results. I learned because this operation seems to be better

30 Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

## 30
doc = []
with open('./neko.txt.mecab') as f:
    token_list = []
    token = f.readline()
    while('EOS' not in token):
        dic = {}
        dic['surface'] = token.split('\t')[0]
        dic['base'] = token.split('\t')[1].split(',')[-3]
        dic['pos'] = token.split('\t')[1].split(',')[0]
        dic['pos1'] = token.split('\t')[1].split(',')[1]
        token = f.readline()
        if dic['surface'] == '。':
            doc.append(token_list)
            token_list = []
            continue
        token_list.

It may be better to store the return value of token.split ('\ t') once

31 verb

Extract all the surface forms of the verb.

## 31
for s in doc:
    for t in s:
        if t['pos'] == 'verb':
            print(t['surface'])

I would definitely write [t ['surface'] for t in s if t ['pos'] =='verb']

32 Prototype of verb

Extract all the original forms of the verb.

## 32
for s in doc:
    for t in s:
        if t['pos'] == 'verb':
            print(t['base'])

Similarly, [t ['base'] for t in s if t ['pos'] =='verb']

33 Sahen noun

Extract all the nouns of the s-irregular connection.

## 33
for s in doc:
    for t in s:
        if t['pos1'] == 'Change connection':
            print(t['base'])

Similarly, [t ['base'] for t in s if t ['pos1'] =='sa-hen noun']

34 "B of A"

Extract a noun phrase in which two nouns are connected by "no".

## 34
for s in doc:
    for i, t in enumerate(s):
        if t['surface'] == 'of' and i + 1 != len(s):
            if s[i -1]['pos'] == 'noun' and s[i +1]['pos'] == 'noun':
                print(s[i -1]['surface'] + t['base'] + s[i +1]['surface'])

Assuming that there is no sentence starting with the morpheme "no", I am careful not to let the index pop out behind.

Assuming there is no sentence starting with the morpheme "no"

Probably not good

35 Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

## 35
## 35
max_list = []
tmp = ""
max_len = len(tmp)
for s in doc:
    for i, t in enumerate(s):
        if t['pos'] == 'noun' :
                tmp += t['surface']
        else:
            if len(tmp) == max_len:
                max_list.append(tmp)
            elif len(tmp) > max_len:
                max_list = []
                max_list.append(tmp)
                max_len = len(tmp)
            tmp = ''
print(len(max_list[0]))
print(max_list)

It was 30 letters in English words

36 Word frequency

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

## 36
base_list = []
count_dic = {}
for s in doc:
    for t in s:
        base_list.append(t['base'])
for word in set(base_list):
    count_dic[word] = base_list.count(word)
sorted(count_dic.items(), key=lambda x: -x[1])

base_list = [t ['base'] for s in doc for t in s]

37 Top 10 most frequent words

Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).

## 37
import matplotlib.pyplot as plt
import japanize_matplotlib
%matplotlib inline
n = 10
labels = [i[0] for i in sorted(count_dic.items(), key=lambda x: -x[1])[:n]]
score = [i[1] for i in sorted(count_dic.items(), key=lambda x: -x[1])[:n]]

plt.bar(labels, score)
plt.show()

If I was addicted to setting fonts for Japanese display on matplotlib, I came across a good one called japanize-matplotlib.

38 Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).

## 38
import matplotlib.pyplot as plt
import japanize_matplotlib
%matplotlib inline
all_score = [i[1] for i in sorted(count_dic.items(), key=lambda x: -x[1])]
plt.hist(all_score, range(10, 100));

From this area, I get used to sorting the list of dictionaries.

39 Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

## 39
import math
log_idx = [math.log(i + 1) for i in range(len(count_dic.values()))]
log_all_score = [math.log(i[1]) for i in sorted(count_dic.items(), key=lambda x: -x[1])]
plt.scatter(log_idx, log_all_score, range(10, 100));

I didn't know, so it was amazing to see the output I used math instead of numpy

end

Is it okay to post the problem like this, if not, I will erase it immediately If you have a community such as a seminar, you should decide 10 questions every week and review each other as a whole. I want to summarize up to the last in the Advent calendar ~