This article is from Kronos Co., Ltd. "~ Spring 2020 ~ I will do it on my own Advent Calendar" This is the article for the third day!
I would like to try something called COTOHA API (natural language processing / speech processing API platform) used by cutting-edge engineers. ・ ・ ・ That's why About COTOHA API, which has a lot of interesting articles, I thought I would do something fun and flashy, but this time I just wrote an article about it. I wrote it !! (contradiction) As for feelings, how about preparing for something flashy? I intend to say that.
** What you know **
--Detailed usage of anaphora analysis in COTOHA API. --Behavior that you noticed in the trial, which is not written in the reference of COTOHA API --How to manage Json returned by API response in class
** Unknown **
--Information in versions prior to Python 3.7 --Smart coding (I wrote it on the way)
First, here is a quote from the Official Page about "what is anaphora resolution".
Extract antecedents (including antecedents consisting of multiple words) corresponding to demonstratives such as "there" and "it", pronouns such as "he" and "she", and anaphoric words such as "same 〇〇". It is a RESTful API that outputs all as the same thing.
Hmmm, for example? (Further quote)
For example, in the analysis of the dialogue log between the dialogue engine and the user, by extracting the word pointed to by the pronoun from the sentence containing the pronoun and the context before and after it, it is not so meaningful for log analysis such as "he" and "she". It is possible to replace missing words with pronouns and achieve more precise log analysis.
In other words, (this is also an official example sentence), when the sentence "Taro is a friend. He ate yakiniku." Is analytically analyzed, "** Taro " and " he **" are The result will be returned together.
Check up to here
** "If you do preprocessing with anaphora resolution before doing flashy things, the results of other natural language processing will change (more accuracy)?" **
Then, I decided to do the title " Resolve sentences with COTOHA API and save them in a file "
(I was off the beaten track because I was trying to do it first).
Maybe there is demand. That is also a factor.
This time
-Scraping any text of Aozora Bunko --Throw sentences to COTOHA API anaphora resolution --Assign the response from json format to the class. ――Wrap it into one anaphoric word and save it in a file.
Consider the implementation.
For example, in the previous example, " Taro is a friend. He ate grilled meat. "
And " Taro is a friend. Taro ate grilled meat. "
Save to a text file.
See here for the entire source. It also includes some processing that is not related to anaphora resolution. The folder structure looks like this.
├── aozora_scraping.py
├── config.ini
├── cotoha_function.py
├── json_to_obj.py
├── main.py
├── respobj
│ ├── __init__.py
│ └── coreference.py
└── result
__Pycache__
and README.md
are omitted. It is assumed that the resulting text will be stored under the result
folder.aozora_scraping.py
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
def get_aocora_sentence(aozora_url):
res = requests.get(aozora_url)
#Beautiful Soup initialization
soup = BeautifulSoup(res.content, 'lxml')
#Get the main text of Aozora Bunko
main_text = soup.find("div", class_="main_text")
#Elimination of ruby
for script in main_text(["rp","rt","h4"]):
script.decompose()
sentences = [line.strip() for line in main_text.text.splitlines()]
#Elimination of empty parts
sentences = [line for line in sentences if line != '']
return sentences
If you pass the URL of Aozora Bunko to the method get_aocora_sentence
, the text of that page will be returned as a list for each line break, omitting ruby and margins.
main_text = soup.find("div", class_="main_text")
Something is done after finding out that the text of Aozora Bunko is surrounded by <div class =" main_text "> </ div>
.
I referred to the following, such as how to process the text of Aozora Bunko.
I tried to extract and illustrate the stage of the story using COTOHA
cotoha_function.py
# -*- coding:utf-8 -*-
import os
import urllib.request
import json
import configparser
import codecs
#COTOHA API operation class
class CotohaApi:
#Initialization
def __init__(self, client_id, client_secret, developer_api_base_url, access_token_publish_url):
self.client_id = client_id
self.client_secret = client_secret
self.developer_api_base_url = developer_api_base_url
self.access_token_publish_url = access_token_publish_url
self.getAccessToken()
#Get access token
def getAccessToken(self):
#Access token acquisition URL specification
url = self.access_token_publish_url
#Header specification
headers={
"Content-Type": "application/json;charset=UTF-8"
}
#Request body specification
data = {
"grantType": "client_credentials",
"clientId": self.client_id,
"clientSecret": self.client_secret
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
res = urllib.request.urlopen(req)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get an access token from the response body
self.access_token = res_body["access_token"]
#Resolution API
def coreference(self, document):
#Correspondence analysis API acquisition URL specification
url = self.developer_api_base_url + "v1/coreference"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"document": document
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
For the functions that use the COTOHA API, refer to I tried using the COTOHA API rumored to be easy to handle natural language processing in Python. I got it.
However, please note that the url has changed from beta / coreference
to v1 / coreference
for anaphoric analysis. (Now the beta version will change someday, maybe)
The first half of passing sentences to resolution analysis in main.py
is as follows. I wrote it (I wrote it as it is because there is a part to explain)
main.py
# -*- coding:utf-8 -*-
import os
import json
import configparser
import datetime
import codecs
import cotoha_function as cotoha
from aozora_scraping import get_aocora_sentence
from respobj.coreference import Coreference
from json_to_obj import json_to_coreference
if __name__ == '__main__':
#Get the location of the source file
APP_ROOT = os.path.dirname(os.path.abspath( __file__)) + "/"
#Get set value
config = configparser.ConfigParser()
config.read(APP_ROOT + "config.ini")
CLIENT_ID = config.get("COTOHA API", "Developer Client id")
CLIENT_SECRET = config.get("COTOHA API", "Developer Client secret")
DEVELOPER_API_BASE_URL = config.get("COTOHA API", "Developer API Base URL")
ACCESS_TOKEN_PUBLISH_URL = config.get("COTOHA API", "Access Token Publish URL")
#constant
max_word = 1800
max_call_api_count = 150
max_elements_count = 20
#URL of Aozora Bunko
aozora_html = 'Any'
#Current time
now_date = datetime.datetime.today().strftime("%Y%m%d%H%M%S")
#The path of the file to save the original text
origin_txt_path = './result/origin_' + now_date + '.txt'
#Path of the file to save the result
result_txt_path = './result/converted_' + now_date + '.txt'
#COTOHA API instantiation
cotoha_api = cotoha.CotohaApi(CLIENT_ID, CLIENT_SECRET, DEVELOPER_API_BASE_URL, ACCESS_TOKEN_PUBLISH_URL)
#Get text from Aozora Bunko
sentences = get_aocora_sentence(aozora_html)
#Save original text for comparison
with open(origin_txt_path, mode='a') as f:
for sentence in sentences:
f.write(sentence + '\n')
#initial value
start_index = 0
end_index = 0
call_api_count = 1
temp_sentences = sentences[start_index:end_index]
elements_count = end_index - start_index
limit_index = len(sentences)
result = []
print("Total number of lists" + str(limit_index))
while(end_index <= limit_index and call_api_count <= max_call_api_count):
length_sentences = len(''.join(temp_sentences))
if(length_sentences < max_word and elements_count < max_elements_count and end_index < limit_index):
end_index += 1
else:
if end_index == limit_index:
input_sentences = sentences[start_index:end_index]
print('index: ' + str(start_index) + 'From' + str(end_index) + 'Until')
#Exit conditions
end_index += 1
else:
input_sentences = sentences[start_index:end_index - 1]
print('index: ' + str(start_index) + 'From' + str(end_index-1) + 'Until')
print(str(call_api_count) + 'Second communication')
response = cotoha_api.coreference(input_sentences)
result.append(json_to_coreference(response))
call_api_count += 1
start_index = end_index - 1
temp_sentences = sentences[start_index:end_index]
elements_count = end_index - start_index
In the first place, it is not possible to send all the sentences in one request at once (although it may be natural to say that it is natural).
I couldn't find any mention in the reference,
--The maximum text length is about 1800 characters (I tried to extract and illustrate the stage of the story using COTOHA) --The number of elements in the list must not be 20 or more.
(It has not been verified, but I think that the processing of other COTOHA APIs may be similar or close) After the while statement, in the if statement, "Throw the text data packed in the list for each line break as much as possible to the anaphora resolution" is implemented.
I was particular about the list rather than just the simple sentence length because I thought that the accuracy might be affected if the anaphora analysis was not performed at the breaks in the sentence. (Expected and unverified)
call_api_count <= max_call_api_count
With the free plan, each API 1000 calls / day, so I made a bad statement that I wanted to control the number of API calls to some extent.
I think this is a matter of taste, but Isn't it easier to assign the API response to a class than to use it as it is in a dictionary type? It is a proposal department.
In the case of COTOHA API, the dictionary type seems to be the majority in the Qiita article, so I will post a reference about the correspondence analysis.
First, let's look at an official example of what kind of json format the response of anaphora analysis comes in. (As usual, if you throw "Taro is a friend. He ate yakiniku.")
coreference.json
{
"result" : {
"coreference" : [ {
"representative_id" : 0,
"referents" : [ {
"referent_id" : 0,
"sentence_id" : 0,
"token_id_from" : 0,
"token_id_to" : 0,
"form" : "Taro"
}, {
"referent_id" : 1,
"sentence_id" : 1,
"token_id_from" : 0,
"token_id_to" : 0,
"form" : "he"
} ]
} ],
"tokens" : [ [ "Taro", "Is", "friend", "is" ], [ "he", "Is", "Roasted meat", "To", "eat", "Ta" ] ]
},
"status" : 0,
"message" : "OK"
}
The definition of the class to which this can be assigned is as follows. If you stare at the Official Reference, you'll find out. Let's define from the deep part of the json hierarchy.
resobj/coreference.py
# -*- coding: utf-8; -*-
from dataclasses import dataclass, field
from typing import List
#Entity object
@dataclass
class Referent:
#Entity ID
referent_id: int
#The number of the statement that contains the entity
sentence_id: int
#Entity start morpheme prime number
token_id_from: int
#Entity end morpheme prime number
token_id_to: int
#Anaphora of the subject
form: str
#Resolution analysis information object
@dataclass
class Representative:
#Resolution analysis information ID
representative_id: int
#Array of entity objects
referents: List[Referent] = field(default_factory=list)
#Resolution analysis result object
@dataclass
class Result:
#Array of analytic information objects
coreference: List[Representative] = field(default_factory=list)
#An array of notations obtained by morphological analysis of each sentence to be analyzed
tokens: List[List[str]] = field(default_factory=list)
#response
@dataclass
class Coreference:
#Resolution analysis result object
result: Result
#Status code 0:OK, >0:error
status: int
#Error message
message: str
Where I got caught
For some reason, the form field of the "Referent" class was not explained in Official Reference.
It took me some time to notice that the tokens
of the" Result "class wasList [List [str]]
.
Method to assign json to class (json_to_coreference
is also described in main.py
)
json_to_obj.py
# -*- coding:utf-8 -*-
import json
import codecs
import marshmallow_dataclass
from respobj.coreference import Coreference
def json_to_coreference(jsonstr):
json_formated = codecs.decode(json.dumps(jsonstr),'unicode-escape')
result = marshmallow_dataclass.class_schema( Coreference )().loads(json_formated)
return result
It is implemented as dataclasses
, marshmallow_dataclass
. The marshmallow_dataclass
may often not be installed. (PyPI Page)
This is the main reason why Python 3.7 should be used this time. I recommend this method because I think that even if there is a change in specifications, the corresponding parts are easy to understand and the correspondence will be quick. (I think it's just that I'm not used to Python dictionary types, so please use it as a reference only.)
Reference site JSONize python class
The question here is "which anaphora to round". This time, based on the prediction that what appears first in the text is the main body ~~ easy ~~, we will implement it with the anaphoric words that appeared earlier.
main.py
#Second half
for obj in result:
coreferences = obj.result.coreference
tokens = obj.result.tokens
for coreference in coreferences:
anaphor = []
#Based on the first anaphora in the coreference.
anaphor.append(coreference.referents[0].form)
for referent in coreference.referents:
sentence_id = referent.sentence_id
token_id_from = referent.token_id_from
token_id_to = referent.token_id_to
#Rewrite so that the number of elements in list is not changed for subsequent processing.
anaphor_and_empty = anaphor + ['']*(token_id_to - token_id_from)
tokens[sentence_id][token_id_from: (token_id_to + 1)] = anaphor_and_empty
#Save the modified text to a file
with open(result_txt_path, mode='a') as f:
for token in tokens:
line = ''.join(token)
f.write(line + '\n')
What number element (sentence) of the sentence in tokens
is sentence_id
token_id_from
and token_id_to
mean that the token_id_from
to token_id_to
th elements of the morphologically analyzed element in the sentence_id
th sentence correspond to anaphora.
Find the anaphoric words to rewrite with coreference.referents [0] .form
,
When rewriting
#Rewrite so that the number of elements in list is not changed for subsequent processing.
anaphor_and_empty = anaphor + ['']*(token_id_to - token_id_from)
tokens[sentence_id][token_id_from: (token_id_to + 1)] = anaphor_and_empty
I will do a little work like that.
(The number of elements to be rewritten and the number of elements to be rewritten are forcibly matched)
If you don't do it, the numbers of token_id_from
and token_id_to
will be incorrect. (Please tell me if there is a better way)
main.py
# -*- coding:utf-8 -*-
import os
import json
import configparser
import datetime
import codecs
import cotoha_function as cotoha
from aozora_scraping import get_aocora_sentence
from respobj.coreference import Coreference
from json_to_obj import json_to_coreference
if __name__ == '__main__':
#Get the location of the source file
APP_ROOT = os.path.dirname(os.path.abspath( __file__)) + "/"
#Get set value
config = configparser.ConfigParser()
config.read(APP_ROOT + "config.ini")
CLIENT_ID = config.get("COTOHA API", "Developer Client id")
CLIENT_SECRET = config.get("COTOHA API", "Developer Client secret")
DEVELOPER_API_BASE_URL = config.get("COTOHA API", "Developer API Base URL")
ACCESS_TOKEN_PUBLISH_URL = config.get("COTOHA API", "Access Token Publish URL")
#constant
max_word = 1800
max_call_api_count = 150
max_elements_count = 20
#URL of Aozora Bunko
aozora_html = 'https://www.aozora.gr.jp/cards/000155/files/832_16016.html'
#Current time
now_date = datetime.datetime.today().strftime("%Y%m%d%H%M%S")
#The path of the file to save the original text
origin_txt_path = './result/origin_' + now_date + '.txt'
#Path of the file to save the result
result_txt_path = './result/converted_' + now_date + '.txt'
#COTOHA API instantiation
cotoha_api = cotoha.CotohaApi(CLIENT_ID, CLIENT_SECRET, DEVELOPER_API_BASE_URL, ACCESS_TOKEN_PUBLISH_URL)
#Get text from Aozora Bunko
sentences = get_aocora_sentence(aozora_html)
#Save original text for comparison
with open(origin_txt_path, mode='a') as f:
for sentence in sentences:
f.write(sentence + '\n')
#initial value
start_index = 0
end_index = 0
call_api_count = 1
temp_sentences = sentences[start_index:end_index]
elements_count = end_index - start_index
limit_index = len(sentences)
result = []
print("Total number of lists" + str(limit_index))
while(end_index <= limit_index and call_api_count <= max_call_api_count):
length_sentences = len(''.join(temp_sentences))
if(length_sentences < max_word and elements_count < max_elements_count and end_index < limit_index):
end_index += 1
else:
if end_index == limit_index:
input_sentences = sentences[start_index:end_index]
print('index: ' + str(start_index) + 'From' + str(end_index) + 'Until')
#Exit conditions
end_index += 1
else:
input_sentences = sentences[start_index:end_index - 1]
print('index: ' + str(start_index) + 'From' + str(end_index-1) + 'Until')
print(str(call_api_count) + 'Second communication')
response = cotoha_api.coreference(input_sentences)
result.append(json_to_coreference(response))
call_api_count += 1
start_index = end_index - 1
temp_sentences = sentences[start_index:end_index]
elements_count = end_index - start_index
for obj in result:
coreferences = obj.result.coreference
tokens = obj.result.tokens
for coreference in coreferences:
anaphor = []
#Based on the first anaphora in the coreference.
anaphor.append(coreference.referents[0].form)
for referent in coreference.referents:
sentence_id = referent.sentence_id
token_id_from = referent.token_id_from
token_id_to = referent.token_id_to
#Rewrite so that the number of elements in list is not changed for subsequent processing.
anaphor_and_empty = anaphor + ['']*(token_id_to - token_id_from)
tokens[sentence_id][token_id_from: (token_id_to + 1)] = anaphor_and_empty
#Save the modified text to a file
with open(result_txt_path, mode='a') as f:
for token in tokens:
line = ''.join(token)
f.write(line + '\n')
Image comparing the original text and a part of the converted text with FileMerge (left is original, right is after conversion)
before:
I had a friend from Kagoshima and learned naturally while imitating that person, so I was good at ringing this turf flute.
As I continued to blow it, the teacher looked away and walked away.
↓ after:
Having a friend of Kagoshima, I learned naturally while imitating a Kagoshima, and I was good at playing this turf flute.
As I continued to blow this turf flute, the teacher looked away and walked away.
"That person" tells "a Kagoshima person" that "if you keep blowing it" becomes "if you keep blowing this turf flute".
before:
I immediately went to return the money to the teacher. I also brought the shiitake mushrooms with me.
~About 6 sentences omitted~
The teacher knew a lot of things I didn't know about kidney disease.
"The characteristic of the illness is that you are sick by yourself, but you don't notice it.
An officer I knew was finally killed by it, but he died like a lie.
~
↓
after:
I immediately went to return the money to the teacher. I also brought the shiitake mushrooms with me.
~About 6 sentences omitted~
The teacher knew a lot of things I didn't know about kidney disease.
"The characteristic of illness is that you are sick by yourself, but you don't notice it.
An officer I knew was finally killed by the shiitake mushrooms, but he died like a lie.
~
Example shiitake mushrooms ... There are many other pains. The symbol is not processed well.
I don't think the ancient texts correspond to that much. Half story (it's just interesting because it's like that)
Both are the original on the left and the converted one on the right.
In the end, is the assumption of " "If preprocessing is done by anaphora before doing flashy things, the results of other natural language processing will change (more accuracy)?"
Is true or not. The result was indescribable. I think more verification is needed.
In order to improve the conversion accuracy, it seems better to think more carefully about the question of "which anaphora should be used for rounding", and it may be important to consider the framework of the sentence.
Since Japanese is an on-parade of demonstratives and pronouns, perfection seems difficult, but there are some places where it can be converted comfortably, so I feel that the COTOHA API has considerable potential. that's all!
Recommended Posts