The first day of the advent calendar, but write something that is easy to write without any special feeling
Since I solved 100 knocks on language processing, I will write the answer and impression one by one (currently 11/30 PM 20:30) So the part that can be written is the first part)
--Environment -Dockerfile link (including irrelevant items) --Ability ――I have touched mecab, gensim without knowing anything --How to solve ――For the time being, I solved it by myself and googled every 10 questions to confirm what I was worried about. ――So, there is a part to put two answers --Excuse ――The correction of the wrong answer that I noticed during the compilation is not in time --Thank you ――I am very grateful for those who are self-taught and go through the darkness to publish such teaching materials.
I had to reread the code myself to draw this article, so it was a pseudo code review
--Mixed "
and'
--A mixture of r in row and l in line
Many improvements were found
Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).
## 00
smt = "stressed"
ans = ""
for i in range(len(smt)):
ans += smt[-i - 1]
print(ans)
The writing style I didn't know is ↓
## 00
smt = "stressed"
smt[::-1]
In other words, it was list [start: stop: step]
Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.
## 1
smt = "Patatoku Kashii"
''.join([smt[i] for i in range(len(smt)) if i % 2==0])
I didn't know at this point, so I'll rewrite this too ↓
## 1
smt = "Patatoku Kashii"
smt[::2]
Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.
## 2
smt1 = "Police car"
smt2 = "taxi"
''.join([p + t for p, t in zip(smt1, smt2)])
There is a feeling of gritty
Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance.
## 3
smt = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
[len(w) - w.count(',') - w.count('.') for w in smt.split(' ')]
I wanted to write it well using ʻisalpha ()`, but I couldn't do the double loop in the comprehension, so I temporarily used this as the answer.
Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters. Create.
## 4
smt = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
dic = {}
target_index = [1, 5, 6, 7, 8, 9, 15, 16, 19]
for i, w in enumerate(smt.split(' ')):
if i + 1 in target_index:
dic[i + 1] = w[0]
else:
dic[i + 1] = w[:2]
dic
Is it okay to make a target or a definite decision? Is it okay to divide by if? Or suspicious demon is too dangerous, but proceed as it is
05 n-gram
Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".
## 5
def get_n_gram(n, smt):
words = smt.split(' ')
return [smt[i:i+n] for i in range(len(smt) - n + 1)], [' '.join(words[i:i+n]) for i in range(len(words) -n + 1)]
get_n_gram(3, "I am an NLPer")
I thought that I could write it well with slices, so it may be better to separate the letters and words
Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.
## 6
smt1 = "paraparaparadise"
smt2 = "paragraph"
X = set()
for i in range(len(smt1) - 2 + 1):
X.add(smt1[i:i+2])
Y = set()
for i in range(len(smt2) - 2 + 1):
Y.add(smt2[i:i+2])
print(X | Y)
print(X & Y)
print(X - Y)
print('se' in (X and Y))
print('se' in (X or Y))
Let's call it with inclusion notation ... I reconfirmed that Set can erase duplicates, and when I want something unique from list, it may be set once or ant
Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.
## 7
def get_template(x, y, z):
return "{}of time{}Is{}".format(x, y, z)
get_template(12, 'temperature', 22.4)
I could do this because I usually use it, but I often forget how to specify the position with {0} etc.
Implement the function cipher that converts each character of the given character string according to the following specifications. Replace with (219 --character code) characters in lowercase letters Output other characters as they are Use this function to encrypt / decrypt English messages.
## 8
class Coder:
def __init__():
pass
def encode(smt):
code = ""
for i in range(len(smt)):
if smt[i] .isalpha() and smt[i].islower():
code += chr(219 - ord(smt[i]))
else:
code += smt[i]
return code
def decode(code):
stm = ""
for i in range(len(code)):
if code[i] .isalpha() and code[i].islower():
smt += chr(219 - ord(code[i]))
else:
smt += code[i]
return smt
coder = Coder
smt = "I couldn't believe that"
code = coder.encode(smt)
desmt = coder.encode(code)
print(smt)
print(code)
print(desmt)
I had a bad eye and mistaken cipher as a coder until this moment, and it was a function, not a class. And I will forget the character code no matter how many times I look it up, so I want to summarize it next time
09 Typoglycemia
Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.
## 9
import random
def feel_typoglycemia(smt):
typogly = []
for w in smt.split(' '):
if len(w) <= 4:
typogly.append(w)
else:
mid = list(w)[1:-1]
random.shuffle(mid)
typogly.append(w[0] + ''.join(mid) + w[-1])
return ' '.join(typogly)
smt = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
feel_typoglycemia(smt)
I removed only the beginning and end of the slice, mixed it, and attached it. I don't know the origin of the function name because I don't remember it at that time, but I don't plan to publish it, so it seems that I attached it with the feelings at that time.
I thought that the title was to check the command, but the UNIX command was just for checking the execution result of the program.
hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
Count the number of lines. Use the wc command for confirmation.
## 10
with open('./hightemp.txt',) as f:
print(len([r for r in f.read().split('\n') if r is not '']))
## 10
cat hightemp.txt | wc -l
I think r is from row, but after this it mixes with l in line
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
## 11
with open('./hightemp.txt',) as f:
print([r.replace('\t', ' ') for r in f.read().split('\n') if r is not ''])
## 11
cat hightemp.txt | sed "s/\t/\ /g"
## 11
cat hightemp.txt | tr "\t" "\ "
## 11
expand -t 1 hightemp.txt
I understand that sed is a guy I often use in vim, tr, expand was learning
Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
## 12
with open('./hightemp.txt',) as f:
table = [r for r in f.read().split('\n') if r is not '']
with open('col1.txt', mode='w') as f:
for t in table:
f.write(t.split('\t')[0] + '\n')
with open('col2.txt', mode='w') as f:
for t in table:
f.write(t.split('\t')[1] + '\n')
## 12
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 1 -d " " > col1.txt
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 2 -d " " > col2.txt
I did it honestly without knowing the image of vertical operation
Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
## 13
with open('cols.txt', mode='w') as c:
with open('col1.txt') as f:
with open('col2.txt') as ff:
r1 = f.readline()
r2 = ff.readline()
c.write(r1.replace('\n', '') + '\t' + r2)
while r1:
while r2:
r1 = f.readline()
r2 = ff.readline()
c.write(r1.replace('\n', '') + '\t' + r2)
## 13
paste col1.txt col2.txt > cols.txt
cat cols.txt
A feeling of uselessness that exudes from f and ff paste was learning
Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.
## 14
n = 5
with open('./hightemp.txt') as f:
lines = f.read()
for l in lines.split('\n')[:n]:
print(l)
head -n 5 hightemp.txt
It's a clear mistake because I forgot the command line argument part, I will add the one using sys.argv
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
## 15
n = 5
with open('./hightemp.txt') as f:
lines = f.read()
for l in lines.split('\n')[-n:]:
print(l)
## 15
tail -n 5 hightemp.txt
Similarly, I forgot the command line argument part, so it is a clear wrong answer, I will add the one using sys.argv
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
## 16
import math
with open('./hightemp.txt') as f:
obj = f.read()
lines = [ l for l in obj.split('\n')]
n = 3
ni = math.ceil(len(lines) / n)
for i in range(0, len(lines), ni):
j = i + ni
print(len(lines[i:j]))
## 16
split -n 5 hightemp.txt
Similarly, I forgot the command line argument part, so it is a clear wrong answer, I will add the one using sys.argv
Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.
## 17
with open('./hightemp.txt') as f:
obj = f.read()
set(row.split('\t')[0] for row in obj.split('\n') if not row =='')
## 17
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 1 -d " " | sort | uniq
It's the first time I've connected with a pipe like this, so I knew the joy of One Liner
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
## 18
with open('./hightemp.txt') as f:
obj = f.read()
rows = [row for row in obj.split('\n') if not row =='']
sorted(rows, key=lambda x: -1 * float(x.split('\t')[2]))
## 18
cat hightemp.txt | sed "s/\t/\ /g" | sort -r -k 3 -t " "
float needed a cast
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
## 19
with open('./hightemp.txt') as f:
obj = f.read()
rows =[row.split('\t')[0] for row in obj.split('\n') if not row =='']
c_dic= {}
for k in set(rows):
c_dic[k] = rows.count(k)
sorted(c_dic.items(), key=lambda x: -x[1])
## 19
cat hightemp.txt | sed "s/\t/\ /g" | cut -f 1 -d " " | sort | uniq -c | sort -rn -k 3 -t " "
It is a reflection point to be r or row
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. Information of one article per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.
I feel like I've slipped through without using regular expressions too much
wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/jawiki-country.json.gz
Since it was running on the juypter notebook, run it with a! At the beginning.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
## 20
import json, gzip
with gzip.open('jawiki-country.json.gz', 'rt') as f:
obj = json.loads(f.readline())
while(obj):
try:
obj = json.loads(f.readline())
if obj['title'] == "England":
break
except:
obj = f.readline()
I didn't know gzip completely so I learned
Extract the line that declares the category name in the article.
## 21
for l in obj['text'].split('\n'):
if 'Category' in l:
print(l)
More strict conditions may be better
Extract the article category names (by name, not line by line).
## 22
import re
head_pattern = r'\[\[Category:'
tail_pattern = r'\|?\*?\]\]'
for l in obj['text'].split('\n'):
if 'Category' in l:
l = re.sub(head_pattern, '', l)
print(re.sub(tail_pattern, '', l))
I wrote it on the gorigori
Display the section names and their levels contained in the article (for example, 1 if "== section name ==").
## 23
pattern = '=='
for l in obj['text'].split('\n'):
if pattern in l:
pat_by_sec = ''.join([r'=' for i in range(int(l.count('=') / 2 ))])
sec = len(pat_by_sec) - 1
tab = ''.join(['\t' for i in range(sec - 1)])
print('{}{}. {}'.format(tab, sec, l.replace('=', '')))
It's a little detour because I wanted to indent on the tab when displaying
Extract all media files referenced from the article.
## 24
for l in obj['text'].split('\n'):
if 'File' in l:
print(l.split(':')[1].split('|')[0])
A more strict if statement may be better here as well
Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.
## 25
import re
pattern = r' = '
basic_info = {}
for l in obj['text'].split('\n'):
if pattern in l:
basic_info[l.split(' = ')[0].replace('|', '')] = l.split(' = ')[1]
basic_info
It seems that it is not good to connect methods often when processing text
When processing> 25, remove MediaWiki's emphasized markup (all weak, emphasized, and strongly emphasized) from the template value and convert it to text (reference: markup quick reference table).
## 26
import re
pattern = r' = '
basic_info = {}
for l in obj['text'].split('\n'):
if pattern in l:
basic_info[l.split(' = ')[0].replace('|', '')] = l.split(' = ')[1].replace('\'', '')
basic_info
I started thinking that I would like to proceed with hard coding without demanding versatility in text processing.
In addition to processing> 26, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).
## 27
import re
pattern = r' = '
med_link = r'\[|\]'
basic_info = {}
for l in obj['text'].split('\n'):
if pattern in l:
val = l.split(' = ')[1].replace('\'', '')
val = re.sub(med_link, '', val)
basic_info[l.split(' = ')[0].replace('|', '')] = val
basic_info
While watching the output, I was making corrections on an ad hoc basis
In addition to processing> 27, remove MediaWiki markup from the template values as much as possible and format the basic country information.
## 28
import re
pattern = r' = '
med_link = r'\[|\]'
strong = r'\{|\}'
tag = r'\<+.*\>'
basic_info = {}
for l in obj['text'].split('\n'):
if pattern in l:
val = l.split(' = ')[1].replace('\'', '')
val = re.sub(med_link, '', val)
val = re.sub(strong, '', val)
val = re.sub(tag, '', val)
basic_info[l.split(' = ')[0].replace('|', '')] = val
basic_info
Give up quickly because it was "as much as possible"
Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)
## 29
import requests
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
"action": "query",
"format": "json",
"prop": "imageinfo",
"iiprop": "url",
"titles": "File:" + basic_info['National flag image']
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA["query"]["pages"]
for k, v in PAGES.items():
for kk, vv in v.items():
if kk == 'imageinfo':
print(vv[0]['url'])
I hit the api with reference to the reference code
Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.
wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/neko.txt
import MeCab
t = MeCab.Tagger()
with open('./neko.txt') as f:
text = f.read()
with open('./neko.txt.mecab', mode='w') as f:
f.write(t.parse(text))
Until now, analysis was performed in a series of processes without saving the analysis results. I learned because this operation seems to be better
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
## 30
doc = []
with open('./neko.txt.mecab') as f:
token_list = []
token = f.readline()
while('EOS' not in token):
dic = {}
dic['surface'] = token.split('\t')[0]
dic['base'] = token.split('\t')[1].split(',')[-3]
dic['pos'] = token.split('\t')[1].split(',')[0]
dic['pos1'] = token.split('\t')[1].split(',')[1]
token = f.readline()
if dic['surface'] == '。':
doc.append(token_list)
token_list = []
continue
token_list.
It may be better to store the return value of token.split ('\ t')
once
Extract all the surface forms of the verb.
## 31
for s in doc:
for t in s:
if t['pos'] == 'verb':
print(t['surface'])
I would definitely write [t ['surface'] for t in s if t ['pos'] =='verb']
Extract all the original forms of the verb.
## 32
for s in doc:
for t in s:
if t['pos'] == 'verb':
print(t['base'])
Similarly, [t ['base'] for t in s if t ['pos'] =='verb']
Extract all the nouns of the s-irregular connection.
## 33
for s in doc:
for t in s:
if t['pos1'] == 'Change connection':
print(t['base'])
Similarly, [t ['base'] for t in s if t ['pos1'] =='sa-hen noun']
Extract a noun phrase in which two nouns are connected by "no".
## 34
for s in doc:
for i, t in enumerate(s):
if t['surface'] == 'of' and i + 1 != len(s):
if s[i -1]['pos'] == 'noun' and s[i +1]['pos'] == 'noun':
print(s[i -1]['surface'] + t['base'] + s[i +1]['surface'])
Assuming that there is no sentence starting with the morpheme "no", I am careful not to let the index pop out behind.
Assuming there is no sentence starting with the morpheme "no"
Probably not good
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
## 35
## 35
max_list = []
tmp = ""
max_len = len(tmp)
for s in doc:
for i, t in enumerate(s):
if t['pos'] == 'noun' :
tmp += t['surface']
else:
if len(tmp) == max_len:
max_list.append(tmp)
elif len(tmp) > max_len:
max_list = []
max_list.append(tmp)
max_len = len(tmp)
tmp = ''
print(len(max_list[0]))
print(max_list)
It was 30 letters in English words
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
## 36
base_list = []
count_dic = {}
for s in doc:
for t in s:
base_list.append(t['base'])
for word in set(base_list):
count_dic[word] = base_list.count(word)
sorted(count_dic.items(), key=lambda x: -x[1])
base_list = [t ['base'] for s in doc for t in s]
Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).
## 37
import matplotlib.pyplot as plt
import japanize_matplotlib
%matplotlib inline
n = 10
labels = [i[0] for i in sorted(count_dic.items(), key=lambda x: -x[1])[:n]]
score = [i[1] for i in sorted(count_dic.items(), key=lambda x: -x[1])[:n]]
plt.bar(labels, score)
plt.show()
If I was addicted to setting fonts for Japanese display on matplotlib, I came across a good one called japanize-matplotlib.
Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).
## 38
import matplotlib.pyplot as plt
import japanize_matplotlib
%matplotlib inline
all_score = [i[1] for i in sorted(count_dic.items(), key=lambda x: -x[1])]
plt.hist(all_score, range(10, 100));
From this area, I get used to sorting the list of dictionaries.
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
## 39
import math
log_idx = [math.log(i + 1) for i in range(len(count_dic.values()))]
log_all_score = [math.log(i[1]) for i in sorted(count_dic.items(), key=lambda x: -x[1])]
plt.scatter(log_idx, log_all_score, range(10, 100));
I didn't know, so it was amazing to see the output I used math instead of numpy
Is it okay to post the problem like this, if not, I will erase it immediately If you have a community such as a seminar, you should decide 10 questions every week and review each other as a whole. I want to summarize up to the last in the Advent calendar ~
Recommended Posts