If the story is set in a local area, it's the type that you read by itself. I tried to extract and illustrate the stage of the story by using the named entity extraction of COTOHA.
Google Colaboratory
-Wordcloud is the only one that needs to be installed individually. (Google Colaboratory is convenient because some libraries are already installed in advance) Execute the following command to complete the preparation.
Install wordcloud & download fonts
!pip install wordcloud
!apt-get -y install fonts-ipafont-gothic
!rm /root/.cache/matplotlib/fontlist-v300.json
Clone Aozora Bunko
!git clone --branch master --depth 1 https://github.com/aozorabunko/aozorabunko.git
The part to get any novel from Aozora Bunko
from bs4 import BeautifulSoup
def get_word():
#Specify the path from the cloned html(The sample is Osamu Dazai's Good Bye)
path_to_html='aozorabunko/cards/000035/files/258_20179.html'
#HTML parsing with BeautifulSoup
with open(path_to_html, 'rb') as html:
soup = BeautifulSoup(html, 'lxml')
main_text = soup.find("div", class_='main_text')
for yomigana in main_text.find_all(["rp","h4","rt"]):
yomigana.decompose()
sentences = [line.strip() for line in main_text.text.strip().splitlines()]
aozora_text=','.join(sentences)
#Split by number of characters for cotoha api call(Every 1800 characters)
aozora_text_list = [aozora_text[i: i+1800] for i in range(0, len(aozora_text), 1800)]
return aozora_text_list
In addition, the character string is divided into 1800 characters and arranged so that COTOHA can be executed. (I haven't checked it properly, but 2000 characters didn't work, and when I ran it with 1800 characters, it was cool ... ~~ Check it out ~~)
COTOHA_The part that calls the API
import os
import urllib.request
import json
import configparser
import codecs
import sys
import time
client_id = "Your client ID"
client_secret = "Your own secret key"
developer_api_base_url = "https://api.ce-cotoha.com/api/dev/nlp/"
access_token_publish_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
def cotoha_call(sentence):
#Get access token
def getAccessToken():
url = access_token_publish_url
headers={
"Content-Type": "application/json;charset=UTF-8"
}
data = {
"grantType": "client_credentials",
"clientId": client_id,
"clientSecret": client_secret
}
data = json.dumps(data).encode()
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
res_body = res.read()
res_body = json.loads(res_body)
access_token = res_body["access_token"]
return access_token
#API URL specification(Named entity recognition)
base_url_footer = "v1/ne"
url = developer_api_base_url + base_url_footer
headers={
"Authorization": "Bearer " + getAccessToken(), #access_token,
"Content-Type": "application/json;charset=UTF-8",
}
data = {
"sentence": sentence
}
data = json.dumps(data).encode()
time.sleep(0.5)
req = urllib.request.Request(url, data, headers)
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized or 500 Internal Server Error, reacquire the access token and request again.
if e.code == 401 or 500:
access_token = getAccessToken()
headers["Authorization"] = "Bearer " + access_token
time.sleep(0.5)
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401 or 500
else:
print ("<Error> " + e.reason)
#sys.exit()
res_body = res.read()
res_body = json.loads(res_body)
return res_body
The part that calls COTOHA (named entity recognition), I try to retry only in the case of errors 401 and 500.
The part illustrated in Word Cloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
def get_wordcrowd_mask(text):
#Japanese font specification
f_path = '/usr/share/fonts/opentype/ipafont-gothic/ipagp.ttf'
#wc parameter specification
wc = WordCloud(background_color="white",
width=500,
height=500,
font_path=f_path,
collocations=False,
).generate( text )
#Screen depiction
plt.figure(figsize=(5,5), dpi=200)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
This is the part that illustrates the text using WordCloud.
The part that executes the process
aozora_text_list = get_word()
json_list = []
loc_str = ''
cnt = 0
for i in aozora_text_list:
cnt+=1
print( str(cnt) + '/' + str(len(aozora_text_list)) )
json_list.append(cotoha_call(i))
for i in json_list:
for j in i['result']:
if(j['class'] == 'LOC'):
loc_str = loc_str + j['form'] + ","
get_wordcrowd_mask(loc_str)
The part that simply executes 1 to 3 In addition, the progress at the time of API call is shown as the output result below. (Number in the array when dividing text by n / 1)
python
1/9
2/9
3/9
4/9
5/9
6/9
7/9
8/9
9/9
There are some characters that seem to have nothing to do with the place name, but it seems that they can be extracted in general. Below are the results of other trials.
It's fun. .. ..
Both COTOHA and Colab can be used free of charge, and you can easily experience language processing. It's an environment, so it's great!
That's all, thank you for reading!
Finally, what kind of work is the image below! (Stop it because it's annoying ...)
Recommended Posts