100 amateur language processing knocks: 41

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 5: Dependency Analysis

Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

41. Reading the dependency analysis result (phrase / dependency)

In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.

The finished code:

`main.py`


# coding: utf-8
import CaboCha
import re

fname = 'neko.txt'
fname_parsed = 'neko.txt.cabocha'


def parse_neko():
	'''Parsing "I am a cat"
"I am a cat"(neko.txt)Parsing and analyzing neko.txt.Save to cabocha
	'''
	with open(fname) as data_file, \
			open(fname_parsed, mode='w') as out_file:

		cabocha = CaboCha.Parser()
		for line in data_file:
			out_file.write(
				cabocha.parse(line).toString(CaboCha.FORMAT_LATTICE)
			)


class Morph:
	'''
Morpheme class
Surface form (surface), uninflected word (base), part of speech (pos), part of speech subclassification 1 (pos1)
Have in a member variable
	'''
	def __init__(self, surface, base, pos, pos1):
		'''Initialization'''
		self.surface = surface
		self.base = base
		self.pos = pos
		self.pos1 = pos1

	def __str__(self):
		'''String representation of the object'''
		return 'surface[{}]\tbase[{}]\tpos[{}]\tpos1[{}]'\
			.format(self.surface, self.base, self.pos, self.pos1)


class Chunk:
	'''
Phrase class
List of morphemes (Morph objects) (morphs), destination clause index number (dst),
It has a list (srcs) of the index numbers of the original clause as a member variable.
	'''

	def __init__(self):
		'''Initialization'''
		self.morphs = []
		self.srcs = []
		self.dst = -1

	def __str__(self):
		'''String representation of the object'''
		surface = ''
		for morph in self.morphs:
			surface += morph.surface
		return '{}\tsrcs{}\tdst[{}]'.format(surface, self.srcs, self.dst)


def neco_lines():
	'''Generator of dependency analysis results for "I am a cat"
Read the dependency analysis results of "I am a cat" in sequence,
Returns a list of Chunk classes sentence by sentence

Return value:
List of Chunk classes in one sentence
	'''
	with open(fname_parsed) as file_parsed:

		chunks = dict()		#Store Chunk with idx as key
		idx = -1

		for line in file_parsed:

			#Judgment of the end of one sentence
			if line == 'EOS\n':

				#Returns a list of Chunks
				if len(chunks) > 0:

					#Sort chunks by key and retrieve only value
					sorted_tuple = sorted(chunks.items(), key=lambda x: x[0])
					yield list(zip(*sorted_tuple))[1]
					chunks.clear()

				else:
					yield []

			#The beginning is*Since the line of is the result of dependency analysis, create Chunk
			elif line[0] == '*':

				#Get Chunk index number and contact index number
				cols = line.split(' ')
				idx = int(cols[1])
				dst = int(re.search(r'(.*?)D', cols[2]).group(1))

				#Generate (if not) Chunk and set index number of contact
				if idx not in chunks:
					chunks[idx] = Chunk()
				chunks[idx].dst = dst

				#Generate (if not) Chunk of the contact and add the index number of the contact
				if dst != -1:
					if dst not in chunks:
						chunks[dst] = Chunk()
					chunks[dst].srcs.append(idx)

			#The other lines are morphological analysis results, so create Morph and add it to Chunk.
			else:

				#The surface layer is tab-delimited, otherwise','Separate by break
				cols = line.split('\t')
				res_cols = cols[1].split(',')

				#Create Morph, add to list
				chunks[idx].morphs.append(
					Morph(
						cols[0],		# surface
						res_cols[6],	# base
						res_cols[0],	# pos
						res_cols[1]		# pos1
					)
				)

		raise StopIteration


#Dependency analysis
parse_neko()

#Create a list one sentence at a time
for i, chunks in enumerate(neco_lines(), 1):

	#Display the 8th sentence
	if i == 8:
		for j, chunk in enumerate(chunks):
			print('[{}]{}'.format(j, chunk))
		break

Execution result:

The problem is "display the contact", but the contact source is also displayed to confirm the implementation of the Chunk class.

`Terminal`


[0]I'm srcs[]	dst[5]
[1]Here srcs[]	dst[2]
[2]For the first time srcs[1]	dst[3]
[3]Human srcs[2]	dst[4]
[4]Things srcs[3]	dst[5]
[5]saw. srcs[0, 4]	dst[-1]

CaboCha analysis result format

The dependency analysis result by CaboCha has a line that starts with * inserted in the morphological analysis result, and the dependency analysis result is output there.

`Example of dependency analysis results`


* 3 5D 1/2 0.656580

This line is delimited by whitespace and has the following content:

column	meaning
1	The first column is`*`.. Indicates that this is a dependency analysis result.
2	Phrase number (integer starting from 0)
3	Contact number +`D`
4	Head/Function word positions and any number of feature sequences
5	Engagement score. In general, the larger the value, the easier it is to engage.

Only columns 2 and 3 are used in this issue. Please refer to the official site CaboCha / Pumpkin: Yet Another Japanese Dependency Structure Analyzer for the details of the analysis results.

Chunk object creation order

The problem this time was the order in which the Chunk objects were created. For the time being, read neko.txt.cabocha line by line, create the corresponding Chunk object when even one piece of information to be stored in the Chunk object can be obtained, and add the information there if it has already been created. I tried to implement it in the flow of. The order of creating Chunk objects is not the order of appearance, and since the contents are in no particular order because the dictionary is also used, the Chunk objects are sorted by clause number and extracted at the end. I thought after making it, but it may have been better to create Chunk objects in order of clause number without dependency information, and then set the dependency information later. 　 That's all for the 42nd knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. The data used in this Chapter 5 is based on Natsume Soseki's novel "I Am a Cat" published in Aozora Bunko. .. *