Since the faculty was reflected from the Gorigori Faculty of Science and from the graduate school to the language processing laboratory, I am a beginner in character string processing. If you have any questions, please report them. You can use it by copying it as it is.
How to install MeCab on Google Collaboration, how to write Matplotlib in Japanese, etc. Updated from time to time
Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning)
qiita.rb
target='stressed'
def reverse(target):
length_target = len(target)
w=target
rve=''
for i in range(length_target):
rve += w[length_target - i-1]
return rve
print(reverse(target))
print(target[::-1])
I tried it in two ways. Python had a hard time because I had only touched numerical data processing. If you have free time, you can get on more and more after reviewing!
qiita.rb
#1 of the character string "Patatokukashi",3,5,Take out the 7th character and get the concatenated string.
w='Patatoku Kashii'
w_even = ""
w_odd = ''
j=0
for i in w:
if ( j % 2 == 0):
w_even += w[j]
if (j % 2 == 1):
w_odd += w[j]
j+=1
print(w_even,w_odd)
qiita.rb
#Get the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.
w=''
for i in range(max(len(w_even),len(w_odd))):
w += w_even[i]
w += w_odd[i]
w
qiita.rb
s = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
#Is decomposed into words, 1, 5, 6, 7, 8, 9, 15, 16,The 19th word is the first letter, and the other words are the first two letters, and the position of the word from the extracted string.
#Create an associative array (dictionary type or map type) to (what number of words from the beginning).
def Permalink(s):
s_list = s.split()
num=len(s_list)
w={}
for i in range(num):
if (i==1-1 or i== 5-1 or i== 6-1 or i== 7-1 or i== 8-1 or i== 9-1 or i== 15-1 or i== 16-1 or i== 19-1):
# w.append(s_list[i][0])
w[i]=(s_list[i][0])
else:
w[i]=(s_list[i][0]+s_list[i][1])
print(w)
Permalink(s)
qiita.rb
def n_gram_word(sentense,n):
sen=sentense.split()
w={}
w=set(w)
num=len(sen)
for i in range(num - n + 1):
w0 = ''
for j in range(n):
w0 += sen[i+j]
w0 += ' '
w.add(w0)
return w
def n_gram_moji(sentense,n):
sentense=sentense.replace(' ','')
sen=list(sentense)
w={}
w=set(w)
num=len(sen)
for i in range(num - n + 1):
w0 = ''
for j in range(n):
w0 += sen[i+j]
w.add(w0)
return w
s='I am an NLPer'
n_gram_word(s,2)
Find the set of character bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.
qiita.rb
s1 = "paraparaparadise"
s2 = "paragraph"
w1=n_gram_moji(s1,2)
w2=n_gram_moji(s2,2)
print(type(w1))
print('The union is', w1 | w2)
print('The intersection is', w1 & w2)
print('The difference set is', w1 - w2)
W={'se'}
print(W<=w1)
print(W<=w2)
Implement a function that takes arguments x, y, z and returns the string "y at x is z". In addition, x = 12, y = ”temperature”,
qiita.rb
z=22.As 4, check the execution result
def temple(x,y,z):
print(x,'of time',y,'Is',z)
temple(12,'temperature',22.4)
Implement the function cipher that converts each character of the given character string according to the following specifications. Replace with (219 --character code) characters in lowercase letters Output other characters as they are Use this function to encrypt / decrypt English messages.
qiita.rb
def chipher (s):
# s=s.lower()
# s=s.replace(' ', '')
c=''
for w in s:
a = chr(219-ord(w))
c+=a
return c
def dechipher(s):
c=''
for w in s:
a=chr(219-ord(w))
c += a
return c
s='Neuro Linguistic Programming has two main definitions: while it began as a set of techniques to understand and codify the underlying elements of genius by modeling the conscious and unconscious behaviors of brilliant communicators and therapists, over the years, it has evolved into a set of frameworks, processes and protocols (the results of modeling) that qualified NLP Practitioners currently use to help evoke effective behavioral changes in clients.'
print(s)
print(chipher(s))
print(dechipher(chipher(s)))
qiita.rb
import random
def wordc (s):
s=s.split()
word=''
for w in s:
if len(w)>4:
a=list(w)
a.pop(0)
num=len(w)
a.pop(num-2)
aa=list(w)
random.shuffle(a)
a.insert(0,aa[0])
a.insert(len(w),aa[len(w)-1])
w=''.join(a)
word += w
word += ' '
return word
#s='volcano'
s='Neuro Linguistic Programming has two main definitions: while it began as a set of techniques to understand and codify the underlying elements of genius by modeling the conscious and unconscious behaviors of brilliant communicators and therapists, over the years, it has evolved into a set of frameworks, processes and protocols (the results of modeling) that qualified NLP Practitioners currently use to help evoke effective behavioral changes in clients.'
print(wordc(s))
The only thing that matters is how to insert MeCab in Chapter 3, so let's get on from there first.
I wrote it myself, but with reference to other people
Answer and impression of 100 language processing knocks-Part 1
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.
One article information per line is stored in JSON format
In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format.
The entire file is gzipped
Create a program that performs the following processing.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
qiita.rb
!wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/jawiki-country.json.gz
qiita.rb
## 20
import json, gzip
with gzip.open('jawiki-country.json.gz', 'rt') as f:
country = json.loads(f.readline())
while(country):
country = json.loads(f.readline())
if country['title'] == "England":
break
print(country)
qiita.rb
puts 'code with syntax'
The data disappeared ...
Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
qiita.rb
!wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/neko.txt
with open ('./neko.txt') as f:
t = tagger.parse(f.read())
with open ('./neko.txt.mecab', mode = 'w') as ff:
ff.write(t)
qiita.rb
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
You can now install MeCab. Is! apt UNIX-like? ! aptitude is Ubuntu ?? I'm not sure about this. I referred to other sites. Enable MeCab in Colaboratory.
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
qiita.rb
with open ('./neko.txt.mecab') as f:
line = f.readline()
worddict = dict()
surface = list()
pronaunce = list()
base = list()
pos = list()
pos1 = list()
while 'EOS' not in line:
t = line.split('\t')
surface.append(t[0])
pronaunce.append(t[1])
base.append(t[2])
pos.append(t[3])
pos1.append(t[4])
line = f.readline()
worddict['surface'] = surface
worddict['pronaunce'] = pronaunce
worddict['base'] = base
worddict['pos'] = pos
worddict['pos1'] = pos1
The data from 30 to 36 disappeared, and it really withered. It's easy, so SKIIIIP!
Extract all the surface forms of the verb.
Extract all the original forms of the verb.
Extract a noun phrase in which two nouns are connected by "no".
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match The data has disappeared a lot ... I'll do it if I have time
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
qiita.rb
tango = list()
for i in range(len(worddict['pos'])):
t = worddict['pos'][i]
if t[:2] == 'noun':
tango.append(worddict['surface'][i])
word_d = dict()
for t in set(tango):
word_d[t] = tango.count(t)
wordsort = sorted(word_d.items(), key = lambda x:-x[1])
print (wordsort)
#[('of', 1611), ('Thing', 1207), ('もof', 981), ('You', 973), ('master', 932), ('Hmm', 704), ('Yo', 697), ('Man', 602), ('one', 554), ('what', 539), ('I', 481), ('this', 414),
Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph). First, start by acquiring Japanese fonts. Otherwise, countless tofu will appear when graphed.
qiita.rb
!apt-get -y install fonts-ipafont-gothic
#Delete the cache.
!rm /root/.cache/matplotlib/fontlist-v310.json #Cache to be erased
Erase this -340v. However, it seems to be different for each version, and it was a different value on other sites. Therefore, each person needs to confirm.
qiita.rb
!ls /root/.cache/matplotlib/
#This directory can be obtained using the following code.
matplotlib.get_cachedir()
qiita.rb
wordhead = dict(wordsort[:10])
print(type(wordhead))
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font='IPAGothic')
plt.bar(wordhead.keys(),wordhead.values())
plt.show()
I was quite impressed when this was done. You can do it in Japanese. .. .. If you've been in the undergraduate era ...
Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).
qiita.rb
cooc = dict()
coocdict = dict()
cooclist = list()
for i in range(len(worddict['surface'])):
t = worddict['pos'][i]
if (t[:2] != 'Particle' and t[:2] != 'symbol' and t[:3] != 'Auxiliary verb'):
cooclist.append(worddict['surface'][i])
for t in set(cooclist) :
coocdict[t] = 0
#print(cooclist)
for i in range(len(cooclist)):
if cooclist[i] == 'Cat':
coocdict[cooclist[i-1]] = coocdict[cooclist[i-1]] + 1
coocdict[cooclist[i+1]] = coocdict[cooclist[i+1]] + 1
coocsort = sorted (coocdict.items(), key = lambda x: -x[1])
coochead = dict(coocsort[:10])
plt.bar(coochead.keys(), coochead.values())
Dirty code. There are more absolutely good ways. Even though I was in a language processing laboratory, I didn't know the word co-occurrence, so it became a bottleneck. It would be interesting if I could express my sensibilities a little even with word2vec or such distributed expressions.
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
qiita.rb
ziplow = dict()
for t in set(worddict['base']):
ziplow[t] = worddict['base'].count(t)
import math
zipsort = sorted(ziplow.items(), key = lambda x:-x[1])
zipsort[:100]
zipplot= list()
logi=list()
for i in range(len(set(worddict['base']))):
logi.append(math.log10(i+1))
zipplot.append(math.log10(zipsort[i][1]))
print(zipplot[:][0])
plt.scatter(logi, zipplot)
I don't think I touched on this in the "Natural Language Processing" class that I attended without permission when I was an undergraduate student. Even though a certain university is taking the initiative in making this, the figure is the same as HP (when the prototype is extracted)
First, install CaboCha.
qiita.rb
#Install Mecab
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
qiita.rb
import os
#Create a specified file
filename_crfpp = 'crfpp.tar.gz'
!wget "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ" -O $filename_crfpp
#Extract the archive and move to the specified file
!tar zxvf $filename_crfpp
#Moved to CRF and exists in the current directory.!ls ./Check at
%cd CRF++-0.58
#Execute the script file called configure in this. There is no command with this name
!./configure
#It will tech if it is in the running environment with configure. If cleared, a makefile will be created and run with the make below without options.
!make
!make install
%cd ..
os.environ['LD_LIBRARY_PATH'] += ':/usr/local/lib'
qiita.rb
FILE_ID = "0B4y35FiV1wh7SDd1Q1dUQkZQaUU"
FILE_NAME = "cabocha.tar.bz2"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt
!tar -xvf cabocha.tar.bz2
%cd cabocha-0.69
!./configure --with-mecab-config=`which mecab-config` --with-charset=UTF8
!make
!make check
!make install
%cd ..
!cabocha --version
Command for using in Python
qiita.rb
%cd cabocha-0.69/python
!python setup.py build_ext
!python setup.py install
!ldconfig
%cd ../..
Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
qiita.rb
!wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/neko.txt
import CaboCha
c = CaboCha.Parser()
with open ('neko.txt') as f:
with open ('neko.txt.cabocha', mode = 'w') as ff:
line = f.readline()
while line:
ff.write(c.parse(line).toString(CaboCha.FORMAT_LATTICE))
line = f.readline()
Implement the class Morph that represents morphemes. This class has surface, uninflected, part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.
In addition to 40, implement the clause Chunk class. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.
40 and 41 did it all at once. Because it has inheritance in chunks and morphemes.
qiita.rb
class Morph:
def __init__(self, line):
line = line.replace('\t',',').split(',')
self.surface = line[0]
#Nya Nya doesn't have an uninflected word, because it can't be processed
self.base = line[0]
self.pos = line[1]
self.pos1 = line[2]
def out_morph(self):
print(type(self.surface))
def listmaker (self):
t = [self.surface, self.base, self.pos, self.pos1]
return t
class Chunk(Morph):
def __init__(self, line):
l_sp = line[0].split(' ')
self.srcs = l_sp[1]
self.dst = l_sp[2]
m = []
for i in range(len(line)):
if i != 0:
m.append (Morph(line[i]).listmaker())
self.morph = m
def __str__(self):
c = [self.srcs, self.dst, self.morph]
return c
with open ('neko.txt.cabocha') as f:
text = f.read()
t = [ r for r in text.split('EOS') if r != '\n']
for text_EOS in t:
line_chunk = list()
a = 0
b = 0
num = 2
t = [ r for r in text_EOS.split('\n') if r != '']
for t in text_EOS.split('\n'):
if (a == 1):
num = 3
if t != '':
if a == len(text_EOS.split('\n'))-num:
line_chunk.append(line)
elif (t[0] == '*'):
#Chunk is completed and stored
if line_chunk != '':
line_chunk.append(line)
line = list()
line.append(t)
elif(t[0] != '*'):
line.append(t)
a = a +1
b = b +1
chunk = [Chunk(c).__str__() for c in line_chunk]
print(chunk)
Later, I created it in the form of a list for easy analysis. The result of this is [Responsible source, Responsible party, [Chunk]] Represents. This [chunk] part is in the form of the words [surface, base, pos, pos1] that make up the chunk. [Responsible source, Responsible destination, [[surface, base, pos, pos1] [surface, base, pos, pos1]]] For example, like this.
qiita.rb
[['1', '-1D', [['Thank you', 'Thank you', 'adjective', 'Independence']]]]
[['0', '-1D', [['one', 'one', 'noun', 'number']]], ['0', '2D', [['\u3000', '\u3000', 'symbol', 'Blank']]], ['1', '2D', [['I', 'I', 'noun', '代noun'], ['Is', 'Is', 'Particle', '係Particle']]], ['2', '-1D', [['Cat', 'Cat', 'noun', 'one般'], ['so', 'so', 'Auxiliary verb', '*'], ['is there', 'is there', 'Auxiliary verb', '*']]]]
[['2', '-1D', [['Cat', 'Cat', 'noun', 'General'], ['so', 'so', 'Auxiliary verb', '*'], ['is there', 'is there', 'Auxiliary verb', '*']]], ['0', '2D', [['name', 'name', 'noun', 'General'], ['Is', 'Is', 'Particle', '係Particle']]], ['1', '2D', [['yet', 'yet', 'adverb', 'Particle類接続']]], ['2', '-1D', [['No', 'No', 'adjective', 'Independence']]]]
[['2', '-1D', [['No', 'No', 'adjective', 'Independence']]], ['0', '1D', [['\u3000', '\u3000', 'symbol', 'Blank'], ['Where', 'Where', 'noun', '代noun'], ['so', 'so', 'Particle', '格Particle']]], ['1', '4D', [['Born', 'Born', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*'], ['Or', 'Or', 'Particle', '副Particle/並立Particle/終Particle']]], ['2', '4D', [['Tonto', 'Tonto', 'adverb', 'General']]], ['3', '4D', [['Register', 'Register', 'noun', 'Change connection'], ['But', 'But', 'Particle', '格Particle']]], ['4', '-1D', [['つOr', 'つOr', 'verb', 'Independence'], ['Nu', 'Nu', '助verb', '*']]]]
[['4', '-1D', [['Tsuka', 'Tsuka', 'verb', 'Independence'], ['Nu', 'Nu', '助verb', '*']]], ['0', '1D', [['what', 'what', 'noun', '代noun'], ['But', 'But', 'Particle', '副Particle']]], ['1', '3D', [['dim', 'dim', 'adjective', 'Independence']]], ['2', '3D', [['Damp', 'Damp', 'adverb', 'General'], ['Shi', 'Shi', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*']]], ['3', '5D', [['Place', 'Place', 'noun', '非Independence'], ['so', 'so', 'Particle', '格Particle']]], ['4', '5D', [['Meow meow', 'Meow meow', 'noun', 'General']]], ['5', '7D', [['Crying', 'Crying', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle']]], ['6', '7D', [['いTa事', 'いTa事', 'noun', 'General'], ['Only', 'Only', 'Particle', '副Particle'], ['Is', 'Is', 'Particle', '係Particle']]], ['7', '-1D', [['Memory', 'Memory', 'noun', 'Change connection'], ['Shi', 'Shi', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle'], ['Is', 'Is', 'verb', '非Independence']]]]
[['7', '-1D', [['Memory', 'Memory', 'noun', 'Change connection'], ['Shi', 'Shi', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle'], ['Is', 'Is', 'verb', '非Independence']]], ['0', '5D', [['I', 'I', 'noun', '代noun'], ['Is', 'Is', 'Particle', '係Particle']]], ['1', '2D', [['here', 'here', 'noun', '代noun'], ['so', 'so', 'Particle', '格Particle']]], ['2', '3D', [['start', 'start', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle']]], ['3', '4D', [['Human', 'Human', 'noun', 'General'], ['That', 'That', 'Particle', '格Particle']]], ['4', '5D', [['thing', 'thing', 'noun', '非Independence'], ['To', 'To', 'Particle', '格Particle']]], ['5', '-1D', [['You see', 'You see', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*']]]]
[['5', '-1D', [['You see', 'You see', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*']]], ['0', '8D', [['Moreover', 'Moreover', 'conjunction', '*']]], ['1', '2D', [['after', 'after', 'noun', 'General'], ['so', 'so', 'Particle', '格Particle']]], ['2', '8D', [['listen', 'listen', 'verb', 'Independence'], ['When', 'When', 'Particle', '接続Particle']]], ['3', '8D', [['It', 'It', 'noun', '代noun'], ['Is', 'Is', 'Particle', '係Particle']]], ['4', '5D', [['Student', 'Student', 'noun', 'General'], ['Whenいう', 'Whenいう', 'Particle', '格Particle']]], ['5', '8D', [['Human', 'Human', 'noun', 'General'], ['During ~', 'During ~', 'noun', 'suffix'], ['so', 'so', 'Particle', '格Particle']]], ['6', '7D', [['Ichiban', 'Ichiban', 'noun', 'Adverbs possible']]], ['7', '8D', [['Evil', 'Evil', 'noun', '形容verb語幹'], ['Nana', 'Nana', '助verb', '*']]], ['8', '-1D', [['Race', 'Race', 'noun', 'General'], ['so', 'so', '助verb', '*'], ['Ah', 'Ah', '助verb', '*'], ['Ta', 'Ta', '助verb', '*'], ['so', 'so', 'noun', 'Special'], ['Is', 'Is', '助verb', '*']]]]
It took me several hours to clear this. Well, I didn't know the class font, so I'm glad I did it. I still don't really see the significance of the class. Isn't it usually possible to create a function inside a function in Python?
Extract all the text of the original clause and the text of the clause in the tab-delimited format. However, do not output symbols such as punctuation marks.
qiita.rb
for sentence in chunk_transed[1:100]:
setu = []
print('¥n')
for chunk in sentence[1:]:
surface = str()
for se in chunk[2]:
if (se[2] != 'symbol'):
surface += se [0]
if surface != '':
setu.append([chunk[0], chunk[1],surface])
for s in (setu):
if (s[1] != '-1'):
saki = s[1]
for ss in setu:
if ss[0] == saki:
print(s[2] +'\t\t'+ ss[2])
#
¥n
I am a cat
¥n
No name
Not yet
¥n
Where were you born
Born or not
I don't get it
I have no idea
¥n
Anything dim
In a dim place
In a damp place
Crying at the place
Meow meow crying
I cry and remember
I remember only what I was
¥n
When clauses containing nouns relate to clauses containing verbs, extract them in tab-delimited format. However, do not output symbols such as punctuation marks.
qiita.rb
for sentence in chunk_transed[1:100]:
setu = []
print('¥n')
for chunk in sentence[1:]:
surface = str()
hanteiki_meisi = 0
hanteiki_dousi = 0
for se in chunk[2]:
if (se[2] != 'symbol'):
surface += se [0]
if se [2] == 'noun':
hanteiki_meisi = 1 #The state where the noun exists in the clause
if se [2] == 'verb':
hanteiki_dousi = 1 #The state in which the verb exists in the clause
if surface != '':
setu.append([chunk[0], chunk[1],surface, hanteiki_meisi, hanteiki_dousi])
for s in (setu):
if (s[1] != '-1') and (s[3] == 1):
saki = s[1]
for ss in setu:
if (ss[0] == saki) and ss[4] == 1:
print(s[2] +'\t\t'+ ss[2])
Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.
qiita.rb
#coding:utf-8
import numpy as np
from PIL import Image
import pydot_ng as pydot
#chunk_sentence is(Dependent, destination)Tuples
def graph_maker(chunk_sentence):
graph = pydot.graph_from_edges(chunk_sentence, directed=True)
graph.write_png("./result.png ")
#Loading images
im = Image.open("./result.png ")
#Convert image to array
im_list = np.asarray(im)
#pasting
plt.imshow(im_list)
#display
plt.show()
def grapf_maker_dot (chunk_sentence):
graph = pydot.Dot(graph_type='digraph')
qiita.rb
all_edge = []
for sentence in chunk_transed[1:100]:
setu = []
for chunk in sentence[1:]:
surface = str()
for se in chunk[2]:
if (se[2] != 'symbol'):
surface += se [0]
if surface != '':
setu.append([chunk[0], chunk[1],surface])
#setu habun
all_edge_sentence = []
for s in (setu):
if (s[1] != '-1'):
saki = s[1]
for ss in setu:
if ss[0] == saki:
edge_sentense = ((s[2] , ss[2]))
all_edge_sentence.append(edge_sentense)
all_edge.append(all_edge_sentence)
graph_maker(tuple(all_edge[22]))
I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.
Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.
To see
Save the output of this program to a file and check the following items using UNIX commands.
A combination of predicates and case patterns that frequently appear in the corpus Case patterns of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)
qiita.rb
kaku = dict()
for sentence in chunk_transed[1:]:
setu = []
for chunk in sentence[1:]:
josi = ''
dousi = ''
hanteiki_meisi = 0
hanteiki_dousi = 0
for se in chunk[2]:
if (se[2] != 'symbol'):
# word += se [0]
if se [2] == 'Particle':
hanteiki_meisi = 1 #The state where the noun exists in the clause
josi = se[0]
if se [2] == 'verb':
hanteiki_dousi = 1 #The state in which the verb exists in the clause
dousi = se[1]
if surface != '':
setu.append([chunk[0], chunk[1],josi, dousi, hanteiki_meisi, hanteiki_dousi])
for s in (setu):
if (s[1] != '-1') and (s[4] == 1):
saki = s[1]
for ss in setu:
if (ss[0] == saki) and ss[5] == 1:
if ss[3] in set(kaku.keys()):
kaku[ss[3]].append(s[2])
else:
kaku[ss[3]] = [s[2]]
text = str()
for keys, values in kaku.items():
values_sort = sorted(set(values))
josi = keys + '\t'
for v in values_sort:
josi += v + ' '
josi = josi + '\n'
text += josi
print(josi)
with open ('./kaku.txt' , mode ='w') as f:
f.write(text)
I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.
When I reply to the letter, my husband
Save the output of this program to a file and check the following items using UNIX commands.
qiita.rb
puts 'code with syntax'
For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications.
Each clause is represented by a (superficial) morpheme sequence From the start clause to the end clause of the path, concatenate the expressions of each clause with "->" From the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
I am->saw
here->Start with->Human->Things->saw
Human->Things->saw
Things->saw
qiita.rb
i= 0
for sentence in chunk_transed[1:10]:
i = i + 1
setu = []
print(i, 'Line')
for chunk in sentence[1:]:
surface = str()
for se in chunk[2]:
if (se[2] != 'symbol'):
surface += se [0]
if surface != '':
setu.append([chunk[0], chunk[1],surface])
# setu = (Dependency number, contact number, expression)
koubunki = ''
for s in (setu):
if (s[1] != '-1'):
saki = s[1]
koubunki = s[2]
for ss in setu:
if ss[0] == saki:
koubunki += ' --> ' + ss[2]
saki = ss[1]
print(koubunki)
The first line
I am-->Be a cat
2nd line
Name is-->No
yet-->No
3rd line
where-->Was born-->Do not use
Was born-->Do not use
Tonto-->Do not use
I have a clue-->Do not use
Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the phrase number of the noun phrase pair is i and j (i <j), the dependency path shall satisfy the following specifications.
In addition, the shape of the dependency path can be considered in the following two ways.
For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
X is|In Y->Start with->Human->Things|saw
X is|Called Y->Things|saw
X is|Y|saw
In X->Start with-> Y
In X->Start with->Human-> Y
Called X-> Y
In this chapter, we use the News Aggregator Data Set published by Fabio Gasparetti to classify news article headlines into the categories of "business," "science and technology," "entertainment," and "health."
Download News Aggregator Data Set, and follow the procedure below to train training data (train.txt) and verification data (valid.txt). , Create evaluation data (test.txt).
qiita.rb
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
import zipfile
with zipfile.ZipFile('./NewsAggregatorDataset.zip') as existing_zip:
existing_zip.extractall('./')
with open('./readme.txt') as f:
text = f.read()
print(text)
qiita.rb
from sklearn.model_selection import train_test_split
import random
#Data to array
with open('./newsCorpora.csv') as f:
text = f.readline()
allinfo = []
print([r for r in text.replace('\n', '').split('\t') if r != ''])
while text:
allinfo.append([r for r in text.replace('\n', '').split('\t') if r != ''])
text = f.readline()
#Data sorting
selectinfo = []
for info in allinfo:
if info[3] == 'Reuters'or info[3] == 'Huffington Post'or info[3] =='Businessweek'or info[3] =='Contactmusic.com'or info[3] =='Daily Mail':
selectinfo.append(info)
#Randomize and split data
random.shuffle(selectinfo)
print(selectinfo[:50])
train , testandaccess = train_test_split(selectinfo,train_size = 0.8)
valid, test = train_test_split(testandaccess, train_size = 0.5)
#Data description
with open ('./train.txt', mode = 'w') as f:
train_txt = str()
for t in train:
for i in range(len(t)):
if i == len(t) -1:
train_txt += t[i] + '\n'
else:
train_txt += t[i] + '\t'
f.write(train_txt)
with open ('./valid.txt', mode = 'w') as f:
valid_txt = str()
for t in valid:
for i in range(len(t)):
if i == len(t) -1:
valid_txt += t[i] + '\n'
else:
valid_txt += t[i] + '\t'
f.write(valid_txt)
with open ('./test.txt', mode = 'w') as f:
test_txt = str()
for t in test:
for i in range(len(t)):
if i == len(t) -1:
test_txt += t[i] + '\n'
else:
test_txt += t[i] + '\t'
f.write(test_txt)
Extract the features from the training data, verification data, and evaluation data, and save them with the file names train.feature.txt, valid.feature.txt, and test.feature.txt, respectively. Feel free to design the features that are likely to be useful for categorization. The minimum baseline would be an article headline converted to a word string.
qiita.rb
with open('./train.txt') as f:
with open('./valid.txt') as ff:
with open('./test.txt') as fff:
i = 0
name = ['train', 'valid', 'test']
FF = [f,ff,fff]
data_dict = dict()
for F in FF:
text = F.readline()
a = 'feature_' + name[i]
array = []
while text:
t = ([r for r in text.replace('\n', '').split('\t') if r != ''])
tt = [t[i] for i in range(len(t)) if i == 1 or i == 3 or i == 4 ]
array.append(tt)
text = F.readline()
data_dict[a] = array
i = i + 1
print(len(valid))
print(len(data_dict['feature_valid']))
print('The numbers match → The above work was done normally')
with open('./train.feature.txt', mode = 'w') as f:
with open('./valid.feature.txt', mode = 'w') as ff:
with open('./test.feature.txt', mode = 'w') as fff:
name = ['feature_train', 'feature_valid', 'feature_test']
FF = [f,ff,fff]
for i in range(len(FF)):
txt = str()
for t in data_dict[name[i]]:
for l in range(len(t)):
if l == len(t) -1:
txt += t[l] + '\n'
else:
txt += t[l] + '\t'
FF[i].write(txt)
with open('./valid.feature.txt') as f:
t = f.read()
print('[Part of the Valid file]')
print(t[:100])
Learn the logistic regression model using the training data constructed in 51. Review of logistic regression => Classification problem, not regression
Review of logistic regression => Classification problem, not regression
Reduced to optimization problems this time
$p(x | y)|_{y_i = 1} = f = \frac{1} {1 + \exp( -{\theta^{ T}} {\bf x} ) } $
And define the softmax function
==>
$ \theta_i <= \theta_i + \eta \frac{\partial L}{\partial \theta_i} $
here,
When the first term is calculated
$\frac{\partial L}{\partial \theta_i} = \frac{y_i - f_{\theta}(x)}{f_{\theta}(x)(1-f_{\theta}(x))}\frac{\partial f_{\theta}(x)}{\partial \theta_i} $
A new function u is defined to differentiate the second term.
$u = 1 + e^{-{\bf \theta}^{\mathrm{T}}{\bf x}} $
If it is defined, it will be as follows.
As a result, the derivative of $ f $ can be expressed as follows. In summary, the differential value for backpropagation of error can be expressed as follows.
$ \frac{\partial f_{\theta}(x)} {\partial u} \frac {\partial u}{\partial \theta_i} = f_{\theta}(1-f_{\theta})x_i $
If you make it a little more neural. $f = \frac{1} {1 + \exp( - {\bf x} ) } $
$ \delta = {\bf y} - {\bf t} $
$\nabla_{\bf W} E = \frac{1}{N}\delta {\bf x}^{\mathrm{T}} $
$\nabla_{\bf b} E = \frac{1}{N}\delta \mathbb{1}_N $
${\bf W} \leftarrow {\bf W} - \epsilon \nabla_{\bf W} E $
${\bf b} \leftarrow {\bf b} - \epsilon \nabla_{\bf b} E $
This time select the latter
Editing data
qiita.rb
with open('./train.feature.txt') as f:
with open('./valid.feature.txt') as ff:
with open('./test.feature.txt') as fff:
i = 0
name = ['train', 'valid', 'test']
FF = [f,ff,fff]
data_dict = dict()
for F in FF:
text = F.readline()
a = 'feature_' + name[i]
array = []
while text:
tt = ([r for r in text.replace('\n', '').split('\t') if r != ''])
#tt = [t[i] for i in range(len(t)) if i == 1 or i == 3 or i == 4 ]
array.append(tt)
text = F.readline()
data_dict[a] = array
i = i + 1
print(data_dict['feature_train'])
x, y = dict(), dict()
for n in name:
a = 'feature_' + n
data = data_dict[a]
x[a] = [[r[0], r[1]] for r in data ]
t[a] = [r[2] for r in data ]
print(x['feature_train'])
print(t['feature_train'])
I have to convert the text to a vector, so the code to convert
qiita.rb
#Character vectorization
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
#Finally only the part is extracted
def summary (data):
text = list()
for xx in data:
text.append((xx[0]))
return text
#Is the label answer vector?
def ans_to_vector (text):
vec = list()
for tt in text:
if tt == 'b':
vec.append(0)
if tt == 'e':
vec.append(1)
if tt == 't':
vec.append(2)
if tt == 'm':
vec.append(3)
return np.eye(len(np.unique(vec)))[vec]
#The argument is x[''feature']
def publisher_to_vector (text):
vec = list()
for tt in text:
if tt[1] == 'Reuters':
vec.append(0)
if tt[1] == 'Huffington Post':
vec.append(1)
if tt[1] == 'Businessweek':
vec.append(2)
if tt[1] == 'Contactmusic.com':
vec.append(3)
if tt[1] == 'Daily Mail':
vec.append(4)
onehotvec = np.eye(len(np.unique(vec)))[vec]
return np.array(onehotvec)
#TfidfVectorizer
def texttovec(textlist):
vec = TfidfVectorizer().fit_transform(textlist)
return vec
#Vector addition (not used)
def vecplusvec (vec1, vec2):
v = list()
for v1, v2 in zip(vec1, vec2):
print(v1[None,:].shape, v2[None,:].shape )
v.append(v1 + v2[None, :])
print('A')
return v
#Is it finally a vector?
text_train = summary(x['feature_train'])
text_valid = summary(x['feature_valid'])
text = text_train + (text_valid)
vec_text = TfidfVectorizer().fit_transform(text_train).toarray()
#Vector split
vec_train, vec_valid = list(), list()
for i in range(vec_text.shape[0]):
if i < len(text_train):
vec_train.append(vec_text[i])
if i < len(text_valid):
vec_valid.append(vec_text[i])
#Pass the vector to Numpy
vec_train = np.array(vec_train)
vec_valid= np.array(vec_valid)
#Answers and publisher vectorization
vec_train_ans = ans_to_vector((t['feature_train']))
vec_valid_ans = ans_to_vector((t['feature_valid']))
vec_train_publisher = publisher_to_vector(x['feature_train'])
vec_valid_publisher = publisher_to_vector(x['feature_valid'])
#Input vector combination
vec_train = np.concatenate([vec_train, vec_train_publisher],axis = 1)
vec_valid = np.concatenate([vec_valid, vec_valid_publisher],1)
print("Input dimension",vec_train.shape)
print('Label (answer) dimension',vec_train_ans.shape)
print("Publisher name vector",vec_train_publisher.shape)
#Input dimension(10684, 12783)
#Label (answer) dimension(10684, 4)
#Publisher name vector(10684, 5)
Coding to learn
qiita.rb
import numpy as np
from sklearn.metrics import accuracy_score
#Prevent the contents of log from becoming 0
def np_log(x):
return np.log(np.clip(a=x, a_min=1e-10, a_max=1e+10))
def sigmoid(x):
# return 1 / (1 + np.exp(- x))
return np.tanh(x * 0.5) * 0.5 + 0.5 #Use numpy built-in tanh(Prevent overflow of exp)
W, b = np.random.uniform(-0.08, 0.08, size = ( vec_train.shape[1],4)), np.zeros(shape = (4,)).astype('float32')
#Learning
def train (x, t, eps = 1):
global W , b
batch_size = x.shape[0]
y = sigmoid(np.matmul(x, W) + b) # shape: (batch_size,Number of dimensions of output) #matmaul:inner product
#Backpropagation
cost = (- t * np_log(y) - (1 - t) * np_log(1 - y)).mean()
delta = y - t # shape: (batch_size,Number of dimensions of output)
#Parameter update
dW = np.matmul(x.T, delta) / batch_size # shape: (Number of input dimensions,Number of dimensions of output)
db = np.matmul(np.ones(shape=(batch_size,)), delta) / batch_size # shape: (Number of dimensions of output,)
W -= eps * dW
b -= eps * db
return cost
#Verification
def valid(x, t):
y = sigmoid(np.matmul(x, W) + b)
cost = (- t * np_log(y) - (1 - t) * np_log(1 - y)).mean()
return cost, y
#Implementation
for epoch in range(3):
for x, t in zip(vec_train, vec_train_ans):
cost = train(x[None, :], t[None, :])
cost, y_pred = valid(vec_valid, vec_valid_ans)
print('EPOCH: {}, Valid Cost: {:.3f}, Valid Accuracy: {:.3f}'.format(
epoch + 1,
cost,
accuracy_score(vec_valid_ans.argmax(axis=1), y_pred.argmax(axis=1))
))
qiita.rb
#EPOCH: 1, Valid Cost: 0.477, Valid Accuracy: 0.647
#EPOCH: 2, Valid Cost: 0.573, Valid Accuracy: 0.598
#EPOCH: 3, Valid Cost: 0.638, Valid Accuracy: 0.570
Since the number of labels is 5, the minimum correct answer rate is 20%. Is the correct answer rate 65% a generally good value? ?? (I didn't really understand the vectorization of the text, and it became quite sparse. You can see that you are overfitting I just did logistic regression in the class of a certain deep learning teacher in the graduate school class, and implemented it after application.
Measure the correct answer rate of the logistic regression model learned in 52 on the training data and evaluation data.
qiita.rb
y = np.round(y_pred, 2)
j = 0
for i in range(y.shape[0]):
if (y_pred.argmax(axis = 1)[i] == vec_valid_ans.argmax( axis = 1)[i]):
j = j +1
print('The correct answer rate is', j/(y.shape[0]))
print('The correct answer rate is (command)= ', accuracy_score(y_pred.argmax(axis = 1), vec_valid_ans.ar
gmax( axis = 1)))
Create a confusion matrix of the logistic regression model learned in 52 on the training data and evaluation data.
qiita.rb
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_pred.argmax(axis = 1), vec_valid_ans.argmax( axis = 1))
print(cm)
Measure the precision, recall, and F1 score of the logistic regression model learned in 52 on the evaluation data. Find the precision, recall, and F1 score for each category, and integrate the performance for each category with micro-average and macro-average.
Review
Recall rate: Percentage of actually positive samples that answered correctly
Singular value: The percentage of the number of data that you do not want to discriminate that was not actually discriminated (how much was the correct answer among the data other than cats?) Conformity rate: The percentage of the number of discriminated data that is correct (the percentage of the discriminated images that is the actual image)
Negative predictive value: Of the number of data that is not subject to discrimination, the percentage of images that are correct (other than cats!) Is actually images other than cats. Percentage)
F1 value ・ ・ ・ F1 value (F1-measure) is the harmonic mean of precision and recall.
$F1 = \frac{2TP}{2TP + FP + FN} $
qiita.rb
from sklearn.metrics import classification_report
print(classification_report(vec_valid_ans.argmax( axis = 1), y_pred.argmax(axis = 1)))
qiita.rb
precision recall f1-score support
0 0.63 0.77 0.69 588
1 0.71 0.63 0.67 518
2 0.13 0.12 0.13 145
3 0.17 0.07 0.10 85
accuracy 0.60 1336
macro avg 0.41 0.40 0.40 1336
weighted avg 0.58 0.60 0.58 1336
Answer and impression of 100 language processing knocks-Part 1 [Enable MeCab in Colaboratory. ] (https://qiita.com/pytry3g/items/897ae738b8fbd3ae7893) Deep learning class Machine learning ~ Text features (CountVectorizer, TfidfVectorizer) ~ Python
Recommended Posts