Overview

If the story is set in a local area, it's the type that you read by itself. I tried to extract and illustrate the stage of the story by using the named entity extraction of COTOHA.

Process flow

Get any novel from Aozora Bunko
Get only place name (LOC) with COTOHA API (named entity recognition)
Use the acquired place name to illustrate with WordCroud.

environment

Google Colaboratory

Library

-Wordcloud is the only one that needs to be installed individually. (Google Colaboratory is convenient because some libraries are already installed in advance) 　 Execute the following command to complete the preparation.

`Install wordcloud & download fonts`


!pip install wordcloud
!apt-get -y install fonts-ipafont-gothic
!rm /root/.cache/matplotlib/fontlist-v300.json

`Clone Aozora Bunko`


!git clone --branch master --depth 1 https://github.com/aozorabunko/aozorabunko.git

code

1. The part to get an arbitrary novel from Aozora Bunko

`The part to get any novel from Aozora Bunko`


from bs4 import BeautifulSoup

def get_word():

  #Specify the path from the cloned html(The sample is Osamu Dazai's Good Bye)
  path_to_html='aozorabunko/cards/000035/files/258_20179.html'
  
  #HTML parsing with BeautifulSoup
  with open(path_to_html, 'rb') as html:
    soup = BeautifulSoup(html, 'lxml')
  main_text = soup.find("div", class_='main_text')
  for yomigana in main_text.find_all(["rp","h4","rt"]):
    yomigana.decompose()
  sentences = [line.strip() for line in main_text.text.strip().splitlines()]
  aozora_text=','.join(sentences)

  #Split by number of characters for cotoha api call(Every 1800 characters)
  aozora_text_list = [aozora_text[i: i+1800] for i in range(0, len(aozora_text), 1800)]
  return aozora_text_list

Get the full text by specifying the path (path_to_html) of any novel from Aozora Bunko cloned from Git I'm parsing with Beautiful Soup. (By the way, the sample is Osamu Dazai's Good Bye)

In addition, the character string is divided into 1800 characters and arranged so that COTOHA can be executed. (I haven't checked it properly, but 2000 characters didn't work, and when I ran it with 1800 characters, it was cool ... ~~ Check it out ~~)

2. The part that calls COTOHA_API

`COTOHA_The part that calls the API`


import os
import urllib.request
import json
import configparser
import codecs
import sys
import time


client_id = "Your client ID"
client_secret = "Your own secret key"

developer_api_base_url = "https://api.ce-cotoha.com/api/dev/nlp/"
access_token_publish_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"

def cotoha_call(sentence):
    #Get access token
    def getAccessToken():     
        url = access_token_publish_url
        headers={
            "Content-Type": "application/json;charset=UTF-8"
        }
        data = {
            "grantType": "client_credentials",
            "clientId": client_id,
            "clientSecret": client_secret
        }
        data = json.dumps(data).encode()
        req = urllib.request.Request(url, data, headers)
        res = urllib.request.urlopen(req)
        res_body = res.read()
        res_body = json.loads(res_body)
        access_token = res_body["access_token"]
        return access_token

    #API URL specification(Named entity recognition)
    base_url_footer = "v1/ne" 
    url = developer_api_base_url + base_url_footer
    headers={
        "Authorization": "Bearer " + getAccessToken(), #access_token,
        "Content-Type": "application/json;charset=UTF-8",
    }
    data = {
        "sentence": sentence
    }
    data = json.dumps(data).encode()
    time.sleep(0.5)
    req = urllib.request.Request(url, data, headers)
        
    try:
        res = urllib.request.urlopen(req)
    #What to do if an error occurs in the request
    except urllib.request.HTTPError as e:
        #If the status code is 401 Unauthorized or 500 Internal Server Error, reacquire the access token and request again.
        if e.code == 401 or 500:
            access_token = getAccessToken()
            headers["Authorization"] = "Bearer " + access_token
            time.sleep(0.5)
            req = urllib.request.Request(url, data, headers)
            res = urllib.request.urlopen(req)
        #Show cause for errors other than 401 or 500
        else:
            print ("<Error> " + e.reason)
            #sys.exit()

    res_body = res.read()
    res_body = json.loads(res_body)
    return res_body

The part that calls COTOHA (named entity recognition), I try to retry only in the case of errors 401 and 500.

3. The part illustrated in WordCloud

`The part illustrated in Word Cloud`


from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image

def get_wordcrowd_mask(text):
  
  #Japanese font specification
  f_path = '/usr/share/fonts/opentype/ipafont-gothic/ipagp.ttf'

  #wc parameter specification
  wc = WordCloud(background_color="white",
                  width=500,
                  height=500,
                  font_path=f_path,
                  collocations=False,
                  ).generate( text )

  #Screen depiction
  plt.figure(figsize=(5,5), dpi=200)
  plt.imshow(wc, interpolation="bilinear")
  plt.axis("off")
  plt.show()

This is the part that illustrates the text using WordCloud.

4. Main

`The part that executes the process`


aozora_text_list = get_word()
json_list = []
loc_str = ''
cnt = 0
for i in aozora_text_list:
  cnt+=1
  print( str(cnt) + '/' + str(len(aozora_text_list)) )
  json_list.append(cotoha_call(i))


for i in json_list:
  for j in i['result']:
    if(j['class'] == 'LOC'):
      loc_str = loc_str + j['form'] + ","

get_wordcrowd_mask(loc_str)

The part that simply executes 1 to 3 In addition, the progress at the time of API call is shown as the output result below. (Number in the array when dividing text by n / 1)

`python`


1/9
2/9
3/9
4/9
5/9
6/9
7/9
8/9
9/9

Output result

・ Good Bye (Osamu Dazai)

There are some characters that seem to have nothing to do with the place name, but it seems that they can be extracted in general. Below are the results of other trials.

・ The Setting Sun (Osamu Dazai)

・ Night on the Galactic Railroad (Kenji Miyazawa)

・ Lemon (Motojiro Kajii)

It's fun. .. ..

Summary

Both COTOHA and Colab can be used free of charge, and you can easily experience language processing. It's an environment, so it's great!

That's all, thank you for reading!

Finally, what kind of work is the image below! (Stop it because it's annoying ...)

I tried to extract and illustrate the stage of the story using COTOHA