COTOHA API is collaborating with Qiita. I want a PS4 to remake FF7 soon .. (-p-)
https://zine.qiita.com/event/collaboration-cotoha-api/
It's a completely impure motive, but I tried natural language processing using the COTOHA API. Today is the deadline for posting, so it's pretty close, but I managed to make it ...
I tried to summarize news articles using only the API provided by COTOHA. The theme is ** Yahoo! News Headline Generation **.
As you all know, Yahoo! News has a headline for each article. For example, it looks like the following.
This headline that I usually see casually is actually made according to various rules and is deep.
First of all, in order to communicate simply and consistently in a limited space The number of characters is limited to ** up to 13 characters ** (to be exact, 13.5 characters including half-width spaces).
The heading also contains ** location information **. In the case of an incident or accident, the importance of news and the degree of interest of users vary greatly depending on where it occurs.
And the words used in the headline are basically ** words in the article **. This is because the articles are distributed to each media, so unless it fits in the number of characters, It seems that he does so so as not to twist the content of the article.
If you want to make a headline using the words in the article, use the COTOHA API I thought I could do it to some extent.
There are other rules, but the rules we have covered this time are summarized below.
-** [Rule 1] The maximum number of characters in the headline is 13 characters ** -** [Rule 2] Include location information in headings ** -** [Rule 3] Use the words in the article for the headline **
[Reference] The secret of Yahoo! News topics "13-character headlines" https://news.yahoo.co.jp/newshack/inside/yahoonews_topics_heading.html
COTOHA API ** API for natural language processing and voice recognition ** provided by NTT Communications. 14 natural language processing and speech processing APIs such as parsing and speech recognition are provided. https://api.ce-cotoha.com/contents/index.html
This time, I used the ** Developers version ** of the COTOHA API. There are some restrictions compared to the Enterprise version, but you can use it for free.
This time, I targeted the following article that Bill Gates retired from MS. https://news.yahoo.co.jp/pickup/6354056
The headline that was attached is here.
Bill Gates retires from MS board
Umm. Certainly complete and easy to understand.
The COTOHA API provides a ** summary API **. Although it is still in beta, you can use it to ** extract sentences that you think are important in the sentence **.
First of all, I decided to extract one sentence using this API.
{
"result": "Gates retired from management in 2008 and retired as chairman in 2014, but remained on the board.",
"status": 0
}
I was able to extract it safely, but as it is, it clearly exceeds 13 characters, so I have to ** shorten it **. I was worried about how to shorten it, but I decided to proceed with the method of ** leaving only the keywords with high importance **.
Earlier, I wrote in a Qiita article that you can extract keywords with high importance by using term extract
.
[Reference] Qiita tag automatic generator https://qiita.com/fukumasa/items/7f6f69d4f6336aff3d90
The COTOHA API also provides a ** keyword extraction API **, Characteristic phrases and words contained in the text can be extracted as keywords.
Let's extract keywords for the one sentence extracted earlier.
{
"result": [
{
"form": "Chairman",
"score": 14.48722
},
{
"form": "Line",
"score": 11.3583
},
{
"form": "Retired",
"score": 11.2471
},
{
"form": "board of directors",
"score": 10.0
}
],
"status": 0,
"message": ""
}
At this point, I'm already suspicious ... ** The essential information "who" ** (Mr. Gates) has not been extracted. Well, I'll continue for the time being.
As I wrote in the opening rule, the heading must include location information. COTOHA provides a convenient API for getting location information. ** Named entity extraction API **. By using this API, you can get named entities such as personal names and place names.
I tried it with the one sentence extracted earlier, but it did not include the location information.
If it is included, it's very easy, but ** "extracted location information" with "de" ** I decided to include it at the beginning of the summary.
It seems difficult to generate a headline (sentence) just by arranging these extracted keywords. I couldn't do advanced things to automatically generate sentences based on keywords, and I was quite worried.
Since I am imposing the restriction of using only the COTOHA API, when I was looking at the API list again, one flashed. Using the ** parsing API **, you can add ** particles such as "ga" and "o" to each extracted keyword to connect each keyword **.
By using this API, the text is decomposed into phrases and morphemes, and the dependency relationships between phrases and Dependency relationships between morphemes, semantic information such as part of speech information, etc. are added.
Is there a relationship between the extracted keywords and particles? (I don't know how to express it ...) It seems that I can extract it. For example, in the case of "air is delicious", it is like extracting "ga" as a particle of the keyword "air".
Let's use this API to add particles to the previous keyword.
{
"result": [
{
"chunk_info": {
"id": 0,
"head": 7,
"dep": "D",
"chunk_head": 1,
"chunk_func": 2,
"links": []
},
"tokens": [
{
"id": 0,
"form": "Gates",
"kana": "Gates",
"lemma": "Gates",
"pos": "noun",
"features": [
"Unique",
"Surname"
],
"attributes": {}
},
{
"id": 1,
"form": "Mr",
"kana": "Shi",
"lemma": "Mr",
"pos": "Noun suffix",
"features": [
"noun"
],
"dependency_labels": [
{
"token_id": 0,
"label": "name"
},
{
"token_id": 2,
"label": "case"
}
],
"attributes": {}
},
{
"id": 2,
"form": "Is",
"kana": "C",
"lemma": "Is",
"pos": "Conjunctive particles",
"features": [],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 1,
"head": 4,
"dep": "D",
"chunk_head": 0,
"chunk_func": 1,
"links": []
},
"tokens": [
{
"id": 3,
"form": "2008",
"kana": "Nisen Hachinen",
"lemma": "2008",
"pos": "noun",
"features": [
"Date and time"
],
"dependency_labels": [
{
"token_id": 4,
"label": "case"
}
],
"attributes": {}
},
{
"id": 4,
"form": "To",
"kana": "D",
"lemma": "To",
"pos": "Case particles",
"features": [
"Continuous use"
],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 2,
"head": 3,
"dep": "D",
"chunk_head": 0,
"chunk_func": 1,
"links": []
},
"tokens": [
{
"id": 5,
"form": "management",
"kana": "Keiei",
"lemma": "management",
"pos": "noun",
"features": [
"motion"
],
"dependency_labels": [
{
"token_id": 6,
"label": "case"
}
],
"attributes": {}
},
{
"id": 6,
"form": "of",
"kana": "No",
"lemma": "of",
"pos": "Case particles",
"features": [
"Attributive form"
],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 3,
"head": 4,
"dep": "D",
"chunk_head": 0,
"chunk_func": 1,
"links": [
{
"link": 2,
"label": "adjectivals"
}
]
},
"tokens": [
{
"id": 7,
"form": "Line",
"kana": "Issen",
"lemma": "Line",
"pos": "noun",
"features": [],
"dependency_labels": [
{
"token_id": 5,
"label": "nmod"
},
{
"token_id": 8,
"label": "case"
}
],
"attributes": {}
},
{
"id": 8,
"form": "From",
"kana": "Kara",
"lemma": "From",
"pos": "Case particles",
"features": [
"Continuous use"
],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 4,
"head": 7,
"dep": "P",
"chunk_head": 0,
"chunk_func": 1,
"links": [
{
"link": 1,
"label": "goal"
},
{
"link": 3,
"label": "object"
}
],
"predicate": []
},
"tokens": [
{
"id": 9,
"form": "Retire",
"kana": "Sirizo",
"lemma": "Retire",
"pos": "Verb stem",
"features": [
"K"
],
"dependency_labels": [
{
"token_id": 3,
"label": "nmod"
},
{
"token_id": 7,
"label": "dobj"
},
{
"token_id": 10,
"label": "aux"
},
{
"token_id": 11,
"label": "punct"
}
],
"attributes": {}
},
{
"id": 10,
"form": "Ki",
"kana": "Ki",
"lemma": "Ki",
"pos": "Verb suffix",
"features": [
"Continuous use"
],
"attributes": {}
},
{
"id": 11,
"form": "、",
"kana": "",
"lemma": "、",
"pos": "Comma",
"features": [],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 5,
"head": 7,
"dep": "D",
"chunk_head": 0,
"chunk_func": 1,
"links": []
},
"tokens": [
{
"id": 12,
"form": "14 years",
"kana": "Juyonen",
"lemma": "14 years",
"pos": "noun",
"features": [
"Date and time"
],
"dependency_labels": [
{
"token_id": 13,
"label": "case"
}
],
"attributes": {}
},
{
"id": 13,
"form": "To",
"kana": "Niha",
"lemma": "To",
"pos": "Conjunctive particles",
"features": [],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 6,
"head": 7,
"dep": "D",
"chunk_head": 0,
"chunk_func": 1,
"links": []
},
"tokens": [
{
"id": 14,
"form": "Chairman",
"kana": "Kaicho",
"lemma": "Chairman",
"pos": "noun",
"features": [],
"dependency_labels": [
{
"token_id": 15,
"label": "case"
}
],
"attributes": {}
},
{
"id": 15,
"form": "To",
"kana": "Wo",
"lemma": "To",
"pos": "Case particles",
"features": [
"Continuous use"
],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 7,
"head": 9,
"dep": "D",
"chunk_head": 0,
"chunk_func": 3,
"links": [
{
"link": 0,
"label": "agent"
},
{
"link": 4,
"label": "manner"
},
{
"link": 5,
"label": "time"
},
{
"link": 6,
"label": "agent"
}
],
"predicate": [
"past"
]
},
"tokens": [
{
"id": 16,
"form": "Retired",
"kana": "Tay Ninh",
"lemma": "Retired",
"pos": "noun",
"features": [
"motion"
],
"dependency_labels": [
{
"token_id": 1,
"label": "nsubj"
},
{
"token_id": 9,
"label": "advcl"
},
{
"token_id": 12,
"label": "nmod"
},
{
"token_id": 14,
"label": "nsubj"
},
{
"token_id": 17,
"label": "aux"
},
{
"token_id": 18,
"label": "aux"
},
{
"token_id": 19,
"label": "mark"
},
{
"token_id": 20,
"label": "punct"
}
],
"attributes": {}
},
{
"id": 17,
"form": "Shi",
"kana": "Shi",
"lemma": "Shi",
"pos": "Verb conjugation ending",
"features": [],
"attributes": {}
},
{
"id": 18,
"form": "Ta",
"kana": "Ta",
"lemma": "Ta",
"pos": "Verb suffix",
"features": [
"Connect"
],
"attributes": {}
},
{
"id": 19,
"form": "But",
"kana": "Moth",
"lemma": "But",
"pos": "Connection suffix",
"features": [
"Continuous use"
],
"attributes": {}
},
{
"id": 20,
"form": "、",
"kana": "",
"lemma": "、",
"pos": "Comma",
"features": [],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 8,
"head": 9,
"dep": "D",
"chunk_head": 0,
"chunk_func": 1,
"links": []
},
"tokens": [
{
"id": 21,
"form": "board of directors",
"kana": "Torishima Yakkai",
"lemma": "board of directors",
"pos": "noun",
"features": [],
"dependency_labels": [
{
"token_id": 22,
"label": "case"
}
],
"attributes": {}
},
{
"id": 22,
"form": "To",
"kana": "Niha",
"lemma": "To",
"pos": "Conjunctive particles",
"features": [],
"attributes": {}
}
]
},
{
"chunk_info": {
"id": 9,
"head": -1,
"dep": "O",
"chunk_head": 0,
"chunk_func": 4,
"links": [
{
"link": 7,
"label": "manner"
},
{
"link": 8,
"label": "place"
}
],
"predicate": [
"past",
"past"
]
},
"tokens": [
{
"id": 23,
"form": "Remaining",
"kana": "Saw",
"lemma": "Remain",
"pos": "Verb stem",
"features": [
"R"
],
"dependency_labels": [
{
"token_id": 16,
"label": "advcl"
},
{
"token_id": 21,
"label": "nmod"
},
{
"token_id": 24,
"label": "aux"
},
{
"token_id": 25,
"label": "aux"
},
{
"token_id": 26,
"label": "aux"
},
{
"token_id": 27,
"label": "aux"
},
{
"token_id": 28,
"label": "punct"
}
],
"attributes": {}
},
{
"id": 24,
"form": "Tsu",
"kana": "Tsu",
"lemma": "Tsu",
"pos": "Verb conjugation ending",
"features": [],
"attributes": {}
},
{
"id": 25,
"form": "hand",
"kana": "Te",
"lemma": "hand",
"pos": "Verb suffix",
"features": [
"Connect",
"Continuous use"
],
"attributes": {}
},
{
"id": 26,
"form": "I",
"kana": "I",
"lemma": "Is",
"pos": "Verb stem",
"features": [
"A",
"L for continuous use"
],
"attributes": {}
},
{
"id": 27,
"form": "Ta",
"kana": "Ta",
"lemma": "Ta",
"pos": "Verb suffix",
"features": [
"stop"
],
"attributes": {}
},
{
"id": 28,
"form": "。",
"kana": "",
"lemma": "。",
"pos": "Kuten",
"features": [],
"attributes": {}
}
]
}
],
"status": 0,
"message": ""
}
['Chairman', 'From the line', 'Retired', 'To the board']
Let's combine each keyword with particles added earlier to make a sentence of 13 characters or less. I almost got it in step 4, but I got this result.
Retired chairman from the line
I don't think it's an intriguing headline **, "Who has retired?" It seems that I will tell you.
However, as I wrote in step 2, there is no information saying "who" or company names such as "Microsoft" and "MS", so it feels subtle, so it is objective to see how good the headline generated this time is **. I decided to look it up **.
You can also check the completeness of the generated heading using the COTOHA API. ** Similarity calculation API **. By using this API, you can ** calculate the semantic similarity between two sentences **. The similarity is output in the domain of 0 to 1, and the closer it is to 1, the greater the similarity between texts.
The headline Bill Gates retired from MS board
, which was attached to the article,
I calculated the similarity of the generated heading Chairman retired from the line
.
{
"result": {
"score": 0.9716939
},
"status": 0,
"message": "OK"
}
Oh, isn't 0.97 quite expensive ...! ?? (Puzzled If COTOHA says so. ..
For reference, I also tried other articles.
The article is 4 pages in total, but for the time being, I tried it only on the 1st page. https://news.yahoo.co.jp/pickup/6353834
** ● Generated headline **
In the symbolic attention range(Similarity: 0.45899978)
Even if you look at the headline, it's a mess ... The similarity is also very low. After all, the extracted keywords are ** attached in descending order of score to make a sentence **, so I think it looks like this. You may have known for the first time ** abbreviated chin as ** lentin ** in the microwave. Or rather, what is the Chin Revolution?
This is an article about game regulations in Kagawa Prefecture, which is controversial. https://news.yahoo.co.jp/pickup/6353894
** ● Generated headline **
Measures Ordinance Kagawa Prefectural Assembly Enforcement(Similarity: 0.2842004)
If you look at the headline, you can't tell if it's a game from Kagawa prefecture, but what I want to convey in this article is probably ** 80% of the supporters **. The degree of similarity is also very low. However, the extracted sentence and the generated headline did not contain numerical information. Also, although the article contains a specific value of 84%, it is converted to an easy-to-understand expression of 80% in the headline. It's easier to get a feel for it if you say it roughly than if you say it in detail. Is this area a skill unique to humans?
Yesterday's article. It seems that it snowed in Tokyo. It's still cold days ... https://news.yahoo.co.jp/pickup/6354091
** ● Generated headline **
Observed temperature in central Tokyo(Similarity: 0.99335754)
In the case where location information is included, 4 characters are consumed in central Tokyo, and the amount of information created is not very large. I feel that the extracted keywords also have too much numerical information. However, the similarity is extremely high at 0.99 ...
It may seem a little subtle if the headline that was generated this time is said to be a great success, but it was fun to do it. In the first place, when I examined the summary, it seems that there are roughly the following classifications.
-** Extraction type ** --Extract the sentences that you think are important from the target sentences and create a summary -** Abstract type ** ――Use words that are not included in the original sentence to understand the meaning of the sentence and create an appropriate summary.
The summary API provided by COTOHA used this time is the former ** extraction type **.
However, as we do in Yahoo! News, when trying to create a summary under various rules and restrictions, it is difficult to use only the extraction type, so ** combine with other services ** or the latter ** abstract type I felt that it would be necessary to use the summarization service **.
Also, in order to reduce the number of characters, it seems easy to omit it because there are some rules in the case of country names etc. I feel that the hurdles are still high even if I use natural language processing technology.
I felt that the day when Yahoo! News headline generation (skillful technique) would be replaced by AI would not come for the time being.
I personally love natural language processing because it's interesting. However, I don't have many opportunities to use it at work, so I would like to continue to enjoy it privately. Please PS4.
The following Qiita article is very helpful for the summary.
--Sentence summary (Qiita) for the era of natural language
import requests
import pprint
import json
import re
from bs4 import BeautifulSoup
base_url = 'https://api.ce-cotoha.com/api/dev/nlp/'
'''
Get access token for COTOHA API
'''
def get_access_token():
url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
req_data = {
'grantType' : 'client_credentials',
'clientId' : 'Client ID',
'clientSecret' : 'Client secret'
}
headers = {
'Content-Type' : 'application/json'
}
response = requests.post(url, json.dumps(req_data), headers=headers)
token = response.json()['access_token']
return token
'''
Call the summary API
'''
def get_summary(token, document) :
url = base_url + '/beta/summary'
req_data = {
'document' : document,
'sent_len' : '1'
}
headers = {
'Content-Type' : 'application/json;charset=UTF-8',
'Authorization' : 'Bearer {}'.format(token)
}
response = requests.post(url, json.dumps(req_data), headers=headers)
summary = response.json()['result']
return summary
'''
Call the keyword extraction API
'''
def get_keywords(token, document):
url = base_url + '/v1/keyword'
req_data = {
'document' : document,
'type' : 'default',
'do_segment' : True
}
headers = {
'Content-Type' : 'application/json;charset=UTF-8',
'Authorization' : 'Bearer {}'.format(token)
}
response = requests.post(url, json.dumps(req_data), headers=headers)
keywords = [item.get('form') for item in response.json()['result']]
return keywords
'''
Call the named entity extraction API to get information about the location
'''
def get_ne_loc(token,sentence):
url = base_url + '/v1/ne'
req_data = {
'sentence' : sentence
}
headers = {
'Content-Type' : 'application/json;charset=UTF-8',
'Authorization' : 'Bearer {}'.format(token)
}
response = requests.post(url, json.dumps(req_data), headers=headers)
ne = response.json()['result']
ne_loc = []
for item in ne:
if item['class'] == 'LOC':
ne_loc.append(item['form'])
#There are cases where duplication occurs if only words are used
if ne_loc:
ne_loc = list(set(ne_loc))
return ne_loc
'''
Call the parsing API
'''
def parse_doc(token, sentence) :
url = base_url + '/v1/parse'
req_data = {
'sentence':sentence
}
headers = {
'Content-Type' : 'application/json;charset=UTF-8',
'Authorization' : 'Bearer {}'.format(token)
}
response = requests.post(url, json.dumps(req_data), headers=headers)
parsed_result = response.json()['result']
tokens = []
for tokens_ary in parsed_result:
for token in tokens_ary['tokens']:
if token:
tokens.append(token)
return tokens
'''
Call the similarity calculation API
'''
def get_similarity(token, doc1, doc2):
url = base_url + '/v1/similarity'
req_data = {
's1' : doc1,
's2' : doc2,
'type' : 'kuzure'
}
headers = {
'Content-Type' : 'application/json;charset=UTF-8',
'Authorization' : 'Bearer {}'.format(token)
}
response = requests.post(url, json.dumps(req_data), headers=headers)
similarity = response.json()['result']
return similarity
'''
Yahoo!Extract content from news article URLs
(Supports only a single page, does not support multiple pages or specific article formats...)
'''
def get_contents(url):
top_page = requests.get(url)
soup = BeautifulSoup(top_page.text, 'lxml')
article_url = soup.find('div',class_=re.compile('pickupMain_articleInfo')).find('a').get('href')
article_page = requests.get(article_url)
soup = BeautifulSoup(article_page.text, "lxml")
for tag in soup.find_all('p',{'class':'photoOffer'}):
tag.decompose()
for tag in soup.find_all('a'):
tag.decompose()
contents = re.sub('\n|\u3000','',soup.find('div',class_=re.compile('articleMain')).getText());
return contents
'''
Yahoo!Extract titles from news article URLs
(This is the correct answer)
'''
def get_title(url):
top_page = requests.get(url)
soup = BeautifulSoup(top_page.text, "lxml")
title = soup.find("title").getText().split(' - ')[0]
return title
'''
Yahoo!Generate topics for news articles
'''
def create_news_topic(token, contents):
#Implemented article summary
summary = get_summary(token, contents)
print(summary)
print("------------")
#If the summary is 13 characters or less, return it as a topic
if len(summary) <= 13:
return summary[:-1]
#Extract keywords and place names from the abstract
keywords = get_keywords(token, summary)
print(keywords)
print("------------")
ne_loc = get_ne_loc(token, summary)
print(ne_loc)
print("------------")
topic = ''
#Add to heading if there is location information
#Even if there are multiple, only the first one for the time being
if ne_loc:
topic += ne_loc[0] + 'so'
#Remove if it is also included in the keyword
if ne_loc[0] in keywords:
keywords.remove(ne_loc[0])
#Parsing the abstract
tokens = parse_doc(token, summary)
#Create a summary while getting keyword particles
for keyword in keywords:
for token in tokens:
if token['form'] == keyword:
print(token)
for dependency_label in token['dependency_labels']:
if dependency_label['label'] == 'case':
keyword += tokens[int(dependency_label['token_id'])]['form']
break
break
if len(topic) + len(keyword) <= 13:
topic += keyword
else:
return topic
return topic
'''
Main
'''
if __name__ == '__main__':
#Yahoo you want to generate headlines!News article URL
url = 'https://news.yahoo.co.jp/pickup/6354056'
#Extract article content and title
contents = get_contents(url)
title = get_title(url)
print("------------")
print(contents)
print("------------")
print(title)
print("------------")
#Get a token for COTOHA API
token = get_access_token()
#Generate article headlines
topic = create_news_topic(token, contents)
print(topic)
print("------------")
#Calculate the similarity between the original heading and the generated heading
similarity = get_similarity(token, title, topic)['score']
print(similarity)
print("------------")
Recommended Posts