advent calendar 17th day I was late. .. ..
Since I solved 100 knocks on language processing, I will write the answer and impression one by one (second part) It took a lot longer than last time ~~~
For the environment etc. here (previous link)
Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
import CaboCha
c = CaboCha.Parser()
with open('./neko.txt') as f:
    text = f.read()
with open('./neko.txt.cabocha', mode='w') as f:
    for se in  [s + '。' for s in text.split('。')]:
        f.write(c.parse(se ).toString(CaboCha.FORMAT_LATTICE))
Actually, it's the first time I've done a dependency analysis, so I searched a lot with this alone.
Implement the class Morph that represents morphemes. This class has surface, uninflected, part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.
# 40
class Morph:
    def __init__(self, surface, base, pos, pos1):
        self.surface = surface
        self.base = base
        self.pos = pos
        self.pos1 = pos1
doc = []
with open('./neko.txt.cabocha') as f:
    sentence = []
    line = f.readline()
    while(line):
        while('EOS' not in line):
            if not line.startswith('*'):
                cols = line.split('\t')
                m = Morph(
                    surface=cols[0],
                    base=cols[1].split(',')[-3],
                    pos=cols[1].split(',')[0],
                    pos1=cols[1].split(',')[1],
                )
                sentence.append(m)
            line = f.readline()
        doc.append(sentence)
        sentence = []
        line = f.readline()
print([t.surface for t in doc[2]])
Do it while checking the output format of CaboCha
In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.
# 41
class Chunk:
    def __init__(self, morphs, dst, srcs):
        self.morphs = morphs
        self.dst = dst
        self.srcs = srcs
        
doc = []
with open('./neko.txt.cabocha') as f:
    sentence = []
    line = f.readline()
    while(line):
        if line.startswith('*'):
            cols = line.split(' ')
            #Don't put anything one line above the previous EOS
            if cols[1] != '0':
                sentence.append(c)
            c = Chunk(
                morphs=[],
                dst=int(cols[2].split('D')[0]),
                srcs=[]
            )
        elif 'EOS' in line:
            sentence.append(c)
            #The process of finding something that depends on you
            for i, c in enumerate(sentence):
                c.srcs = [idx for idx, chk, in enumerate(sentence) if chk.dst == i ]
                
            doc.append(sentence)
            sentence = []
        else:
            cols = line.split('\t')
            if cols[1].split(',')[0] != "symbol":
                m = Morph(
                    surface=cols[0],
                    base=cols[1].split(',')[-3],
                    pos=cols[1].split(',')[0],
                    pos1=cols[1].split(',')[1],
                )
                c.morphs.append(m)
        line = f.readline()
for c in doc[7]:
    print(c.dst, end=', ')
    for m in c.morphs:
        print(m.surface, end="")
    print()
for c in doc[0]:
    print(c.dst, end=', ')
    for m in c.morphs:
        print(m.surface)
        print(m.pos)
    print()
I was thinking about how to find the phrase that was related to me, but I couldn't think of a good one, so I simply looped and scanned.
Extract all the text of the original clause and the relationed clause in tab-delimited format. However, do not output symbols such as punctuation marks.
# 42
#All are jupyter(chrome)Because it hardens 50
for i, d in enumerate(doc[:50]):
    for c in d:
        if int(c.dst) == -1:
            continue
        for m in c.morphs:
            if m.pos == 'symbol':
                continue
            print(m.surface, end="")
        print('\t', end="")
        for m in d[c.dst].morphs:
            if m.pos == 'symbol':
                continue
            print(m.surface, end="")
        print()
Do it while being aware of sentences, phrases and morphemes
When clauses containing nouns relate to clauses containing verbs, extract them in tab-delimited format. However, do not output symbols such as punctuation marks.
# 43
#All are jupyter(chrome)Because it hardens 50
for i, d in enumerate(doc[:50]):
    for c in d:
        if int(c.dst) == -1:
            continue
        contain_noun = 'noun' in [m.pos for m in c.morphs]
        contain_verb = 'verb' in [m.pos for m in d[c.dst].morphs]
        if contain_noun and contain_verb:
            for m in c.morphs:
                if m.pos == 'symbol':
                    continue
                print(m.surface, end="")
            print('\t', end="")
            for m in d[int(c.dst)].morphs:
                if m.pos == 'symbol':
                    continue
                print(m.surface, end="")
            print()
I simply searched with ʻif`
Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.
# 44
import random, pathlib
from graphviz import Digraph
f = pathlib.Path('nekocabocha.png')
fmt = f.suffix.lstrip('.')
fname = f.stem
target_doc = random.choice(doc)
target_doc = doc[8]
idx = doc.index(target_doc)
dot = Digraph(format=fmt)
dot.attr("node", shape="circle")
N = len(target_doc)
#Add node
for i in range(N):
    dot.node(str(i), ''.join([m.surface for m in target_doc[i].morphs]))
#Add edge
for i in range(N):
    if target_doc[i].dst >= 0:
        dot.edge(str(i), str(target_doc[i].dst))
# dot.engine = "circo"
dot.filename = filename
dot.render()
print(''.join([m.surface for c in target_doc for m in c.morphs]))
print(dot)
from IPython.display import Image, display_png
display_png(Image(str(f)))
I didn't do much because I thought that visualization would be fun, so I did it for the first time, and Wikipedia was enough to get a rough idea of the DOT language.
I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications. In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. Case particles related to predicates If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.
BeginSee isSave the output of this program to a file and check the following items using UNIX commands. A combination of predicates and case patterns that frequently appear in the corpus Case patterns of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)
# 45
with open("neko_verb.txt", mode="w") as f:
    for s in doc:
        for c in s:
            if 'verb' in [m.pos for m in c.morphs]:
                row = c.morphs[0].base
                j_list = []
                for i in c.srcs:
                    if len(s[i].morphs) < 2:
                        continue
                    srclast = s[i].morphs[-1]
                    if srclast.pos == 'Particle':
                        j_list.append(srclast.surface)
                if len(j_list) > 0:
                    j_list.sort()
                    row += "\t" +  " ".join(j_list)
                    f.write(row + "\n")
$ cat neko_verb.txt | sort  | uniq -c  | sort -rn -k 3
$ cat neko_verb.txt | grep "^To do" | sort  | uniq -c  | sort -rn -k 3
$ cat neko_verb.txt | grep "to see" | sort  | uniq -c  | sort -rn -k 3
$ cat neko_verb.txt | grep "give" | sort  | uniq -c  | sort -rn -k 3
Combine the "part of speech" and "reception" so far
Modify the program> 45 and output the predicate and case pattern followed by the term (the clause itself related to the predicate) in tab-delimited format. In addition to the 45 specifications, meet the following specifications.
The term should be a word string of the clause related to the predicate (there is no need to remove the trailing particle).
If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.
Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.
Begin here
See what I see
# 46
#Although it is different from the output example, this one meets the specification of the problem
for s in doc:
    for c in s:
        if 'verb' in [m.pos for m in c.morphs]:
            row = c.morphs[0].base
            j_list = []
            c_list = []
            for i in c.srcs:
                if len(s[i].morphs) < 2:
                    continue
                srclast = s[i].morphs[-1]
                if srclast.pos == 'Particle':
                    j_list.append(srclast.surface)
                    c_list.append(''.join([m.surface for m in s[i].morphs]))
            if len(j_list) > 0:
                j_list.sort()
                c_list.sort()
                row += "\t" +  " ".join(j_list) + "\t"+  " ".join(c_list)
                print(row)
If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.
So it is different from the output example, but the clauses are also sorted independently
I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications. Only when the phrase consisting of "sa-hen connection noun + (particle)" is related to a verb The predicate is "Sahen connection noun + is the basic form of + verb", and when there are multiple verbs in a phrase, the leftmost verb is used. If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. If there are multiple clauses related to the predicate, arrange all the terms with spaces (align with the order of particles). For example, the following output should be obtained from the sentence, "The master will reply to the letter, even if it comes to another place."
When you reply, the master is in the letterSave the output of this program to a file and check the following items using UNIX commands. Predicates that frequently appear in the corpus (sa-variant noun + + verb) Predicates and particles patterns that occur frequently in the corpus
# 47
with open("neko_func_verb.txt", mode="w") as f:
    for s in doc:
        for c in s:
            if 'verb' in [m.pos for m in c.morphs]:
                verb = c.morphs[0].base
                for i in c.srcs:
                    v_head = s[i].morphs[-2:]
                    if len(v_head) < 2:
                        continue
                    if v_head[0].pos1 == "Change connection" and v_head[1].surface == "To":
                        verb = ''.join([m.surface for m in v_head]) + verb
                        joshi_dic = {}
                        for j in c.srcs:
                            if len(s[j].morphs) < 2:
                                continue
                            srclast = s[j].morphs[-1]
                            if srclast.pos == 'Particle' and srclast.surface != "To":
                                joshi_dic[srclast.surface] =  ''.join([m.surface for m in s[j].morphs])
                        if len(joshi_dic.keys()) > 0:
                            joshi_list = list(joshi_dic.keys())
                            joshi_list.sort()
                            row = verb + "\t" +  " ".join(joshi_list) + "\t" + " ".join([joshi_dic[joshi] for joshi in joshi_list])
                            f.write(row + "\n")
$ cat neko_func_verb.txt | sed "s/\t/ /g"| cut -f 1 -d " " | sort | uniq -c  | sort -rn -k 3
$ cat neko_func_verb.txt | sed "s/\t/+/g"| cut -f 1,2 -d "+" | sed "s/+/\t/g" | sort | uniq -c  | sort -rn -k 3
If there are multiple clauses related to the predicate, arrange all the terms separated by spaces (align with the order of particles).
This was a dictionary type, so I made it correspond No matter how many times I read the problem, I can't understand it and start to get in trouble from here
For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications. Each clause is represented by a (superficial) morpheme sequence From the start clause to the end clause of the path, concatenate the expressions of each clause with "->" From the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
I saw-> sawHere-> First time-> Human-> I saw something->Human-> I saw something->I saw something->
# 48
for s in doc:
    for c in s:
        if "noun" in [m.pos for m in c.morphs]:
            row = "".join([m.surface for m in c.morphs])
            chunk_to = c.dst
            if chunk_to == -1:
                continue
            while(chunk_to != -1):
                row += " -> " + "".join([m.surface for m in s[chunk_to].morphs])
                chunk_to = s[chunk_to].dst
            print(row)
This was solved smoothly compared to the previous problem
Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the phrase number of the noun phrase pair is i and j (i <j), the dependency path shall satisfy the following specifications. Similar to Problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "->". Replace noun phrases in clauses i and j with X and Y, respectively. In addition, the shape of the dependency path can be considered in the following two ways. If clause j exists on the path from clause i to the root of the syntax tree: Show the path from clause i to clause j Other than the above, when clause i and clause j intersect at a common clause k on the path from clause j to the root of the syntax tree: the path immediately before clause i to clause k and the path immediately before clause j to clause k, clause k The contents of are connected by "|" and displayed. For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
X is|In Y->Start with->Human->Things|sawX is|Called Y->Things|sawX is|Y|sawX-> Start-> YX-> Start-> Human-> YX-> Y
# 49
for s in doc:
    # i <No need for the tail because of j
    for i, c in enumerate(s[:-1]):
        if "noun" in [m.pos for m in c.morphs] and c.morphs[-1].pos == "Particle":
            #Find j
            for c_rest in s[i+1:]:
                if "noun" in [m.pos for m in c_rest.morphs] and c_rest.morphs[-1].pos == "Particle":
                    i_clause =  "".join([m.surface if m.pos != "noun" else "X" for m in c.morphs])
                    j_clause =  "".join([m.surface if m.pos != "noun" else "Y" for m in c_rest.morphs])
                    
                    row = i_clause
                    chunk_to = c.dst
                    #Ask for the path to see if j exists on the path
                    kkr_path = [chunk_to]
                    while(kkr_path[-1] != -1):
                        kkr_path.append(s[chunk_to].dst)
                        chunk_to = s[chunk_to].dst
                    
                    if s.index(c_rest) in kkr_path:
                        chunk_to = c.dst
                        while(chunk_to != s.index(c_rest)):
                            row += " -> " + "".join([m.surface for m in s[chunk_to].morphs])
                            chunk_to = s[chunk_to].dst
                        row += " -> " + j_clause
                    else:
                        row += " | " + j_clause
                        chunk_to = c_rest.dst
                        while(s[chunk_to].dst != -1):
                            row += " -> " + "".join([m.surface for m in s[chunk_to].morphs])
                            chunk_to = s[chunk_to].dst
                        row += " | " + "".join([m.surface for m in s[chunk_to].morphs])
                        
                    print(row)
Replace noun phrases in clauses i and j with X and Y, respectively.
The specifications and output examples were also slightly different here, but I replaced the nouns such as "X ga".
Perform the following processing on the English text (nlp.txt).
$ wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/nlp.txt
(. Or; or: or? Or!) → Whitespace characters → Consider the pattern of uppercase letters as sentence delimiters, and output the input document in the form of one sentence per line.
# 50
import re
sentence_sep = re.compile(r'(\.|;|:|\?|!) ([A-Z])')
with open("./nlp.txt") as f:
    txt = f.read()
txt = re.sub(sentence_sep, r'\1\n\2', txt)
print(txt)
I solved it while thinking
Consider whitespace as word delimiters, take 50 outputs as input, and output in the form of one word per line. However, output a blank line at the end of the sentence.
# 51
def space2return(txt):
    sentence_sep = re.compile(r'(\.|;|:|\?|!)\n([A-Z])')
    txt = re.sub(sentence_sep, r'\1\n\n\2', txt)
    return re.sub(r' ', r'\n', txt)
txt = space2return(txt)
print(txt)
Receives 50 outputs, but responds ad hoc because it behaves a little unexpectedly with line breaks in sentences
Take the output of> 51 as input, apply Porter's stemming algorithm, and output the word and stem in tab-delimited format. In Python, the stemming module should be used as an implementation of Porter's stemming algorithm.
# 52
from nltk.stem import PorterStemmer
ps = PorterStemmer()
def stem_text(txt):
    for l in txt.split('\n'):
        yield l + '\t' + ps .stem(l)
    
for line in stem_text(txt):
    print(line)
It wasn't in the link provided, so I replaced it. I tried using yield, which I usually avoid, but found that the return value is an iterator.
53 Tokenization
Use Stanford Core NLP to get the analysis result of the input text in XML format. Also, read this XML file and output the input text in the form of one word per line.
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
$ unzip stanford-corenlp-full-2018-10-05.zip
$ java -cp "./stanford-corenlp-full-2018-10-05/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse,lemma,ner,coref -file ./nlp.txt
This guy, rather than suffering from XML parsing
See the reference and modify -annotators at java runtime as needed
# 53
import xml.etree.ElementTree as ET
tree = ET.parse("./nlp.txt.xml")
root = tree.getroot()
for token in root.iter("token"):
    print(token.find("word").text)
Is this a way to output the whole and check the tags?
Read the analysis result XML of Stanford Core NLP and output words, lemmas, and part of speech in tab-delimited format.
# 54
for token in root.iter("token"):
    print(token.find("word").text + "\t" + token.find("lemma").text + "\t" + token.find("POS").text)
Very hard to see
Extract all personal names in the input text.
# 55
for token in root.iter("token"):
    NERtag = token.find("NER").text
    if NERtag == "PERSON":
        print(token.find("word").text)
I thought it was an implementation but it was tagged
Replace the reference expression (mention) in the sentence with the representative reference expression (representative mention) based on the result of the co-reference analysis of Stanford Core NLP. However, when replacing, be careful so that the original reference expression can be understood, such as "representative reference expression (reference expression)".
# 56
rep_dic_list = []
#Making a dictionary
for coreferences in root.findall("document/coreference"):
    for mentions in coreferences:
        for m in mentions:
            if "representative" in m.attrib:
                rep_txt = m.find("text").text
            else:
                tmp_dic = {}
                tmp_dic["sentence"] = m.find("sentence").text
                tmp_dic["start"] = m.find("start").text
                tmp_dic["end"] = m.find("end").text
                tmp_dic["rep_txt"] = rep_txt
                rep_dic_list.append(tmp_dic)
                
#output
for s in root.iter("sentence"):
    rep_sent_list = [rd for rd in rep_dic_list if rd["sentence"] == s.attrib["id"]]
    #Whether the statement needs to be replaced
    if len(rep_sent_list) == 0:
            print(" ".join([token.find("word").text for token in s.iter("token")]), end=" ")
    else:
        for token in s.iter("token"):
            tid = token.attrib["id"]
            rep_token_list = [rd for rd in rep_sent_list if rd["start"] == tid or rd["end"] == tid]
            
            if len(rep_token_list) > 0:
                #Since there is only one, take it out
                rep_dic = rep_token_list[0]
                
                #Decoration
                if tid == rep_dic["start"]:
                    print("「" + rep_dic["rep_txt"] + " (", end=" ")
                if tid == rep_dic["end"]:
                    print(")」", end=" ")
                    
            print(token.find("word").text, end=" ")
I couldn't understand the problem statement and left it here for a long time. Make a dictionary and qualify rather than replace
Visualize the collapsed-dependencies of Stanford Core NLP as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.
# 57
import random, pathlib
from graphviz import Digraph
f = pathlib.Path('nlp.png')
fmt = f.suffix.lstrip('.')
fname = f.stem
dot = Digraph(format=fmt)
dot.attr("node", shape="circle")
sent_id = 3
for sents in root.findall(f"document/sentences/sentence[@id='{sent_id}']"):
    for deps in sents:
        for dep in deps.findall("[@type='collapsed-dependencies']"):
            for token in dep:
                gvnr = token.find("governor")
                dpnt = token.find("dependent")
                dot.node(gvnr.attrib["idx"], gvnr.text)
                dot.node(dpnt.attrib["idx"], dpnt.text)
                dot.edge(gvnr.attrib["idx"], dpnt.attrib["idx"])
dot.filename = fname
dot.render()
# print(dot)
from IPython.display import Image, display_png
display_png(Image(str(f)))
At first, I passed through governor and dependent, so I didn't understand at all.
Output the set of "subject predicate object" in tab-delimited format based on the result of the dependency analysis (collapsed-dependencies) of Stanford Core NLP. However, refer to the following for the definitions of subject, predicate, and object.
Predicate: A word that has children (dependants) of nsubj and dobj relationshipsSubject: Child (dependent) that has an nsubj relationship from the predicateObject: A child (dependent) that has a dobj relationship from the predicate
# 58
for sents in root.findall(f"document/sentences/sentence"):
    for deps in sents:
        for dep in deps.findall("[@type='collapsed-dependencies']"):
            nsubj_list = []
            for token in dep.findall("./dep[@type='nsubj']"):
                gvnr = token.find("governor")
                dpnt = token.find("dependent")
                nsubj_list.append( {
                    (gvnr.attrib["idx"], gvnr.text): (dpnt.attrib["idx"], dpnt.text)
                })
            for token in dep.findall("./dep[@type='dobj']"):
                gvnr = token.find("governor")
                dpnt = token.find("dependent")
                dobj_tuple = (gvnr.attrib["idx"], gvnr.text)
                
                if dobj_tuple in [list(nsubj.keys())[0] for nsubj in nsubj_list]:
                    idx =  [list(nsubj.keys())[0] for nsubj in nsubj_list].index( dobj_tuple )
                    jutugo = gvnr.text
                    shugo = nsubj_list[idx][dobj_tuple][1]
                    mokutekigo = dpnt.text
                    print(shugo + "\t" + jutugo + "\t" + mokutekigo)
Policy to create a dictionary once and then search for one that meets the conditions
Read the result of phrase structure analysis (S-expression) of Stanford Core NLP and display all noun phrases (NP) in the sentence. Display all nested noun phrases as well.
# 59 
import xml.etree.ElementTree as ET
import re
def search_nest(t):
    if isinstance(t[0], str):
        if isinstance(t[1], str):
            if t[0] == "NP":
                print(t[1])
            return t[1]
        else:
            if t[0] == "NP":
                np_list = []
                for i in t[1:]:
                    res = search_nest(i)
                    if isinstance(res, str):
                        np_list.append(search_nest(i))
                if len(np_list) > 0:
                    print(' '.join(np_list))
            else:
                for i in t[1:]:
                    search_nest(i)
    else:
        for i in t:
            search_nest(i)
tree = ET.parse("./nlp.txt.xml")
root = tree.getroot()
sent_id = 30
for parse in root.findall(f"document/sentences/sentence[@id='{sent_id}']/parse"):
    S_str = parse.text
    S_str = S_str.replace("(", "('")
    S_str = S_str.replace(")", "')")
    S_str = S_str.replace(" ", "', '")
    S_str = S_str.replace("'(", "(")
    S_str = S_str.replace(")'", ")")
    exec(f"S_tuple = {S_str[:-2]}")
    search_nest(S_tuple)
    
I couldn't recognize the nesting of () and made the biggest dirty implementation in my life. I hard-coded it into a tuple type and recursively extracted it.
Even if I tried regex, I couldn't solve it myself, so I gave priority to the answer.
It took about twice as long as chapters 1 to 4 because both chapters were the first libraries and the number of language processing terms increased suddenly. However, the rest are DB, ML, and vector, so I'm not careful.
Recommended Posts