A story of reading a picture book by synthesizing voice with COTOHA API and Cloud Vision API

What is COTOHA API?

It is a group of APIs that can be used for language analysis, etc. issued by NTT. Not only syntax analysis but also voice recognition and voice synthesis (charged) are included, so if you have this, you can do most of the conversation robots and speech analysis! It is full of functions that can reach the itchy place, such as keyword extraction and stagnation removal, which have been implemented steadily until now, and the matching rate of user responses in deep learning can also be taken, so if there is accuracy, Japanese Isn't it the strongest in dealing with?

COTOHA API page

It is a service that provides various natural language processing / speech processing APIs such as parsing, anaphora resolution, keyword extraction, speech recognition, and summarization. Utilizing the NTT Group's 40-year research results, such as Japanese dictionaries and technology for classifying the meanings of more than 3000 types of words, advanced analysis can be easily used with APIs.

This product

I thought that it would be very interesting if I could extract characters from the images taken of picture books, analyze the sentences, add direction, and output as a theater, so I made a trial of that proto. I spend days I can't meet my daughter who is returning to her daughter-in-law's parents' house because of the coronavirus, so I made it because I wanted to play with my daughter when it subsided. If you connect to character recognition → voice synthesis normally, you will become a hiragana stick reading man, so you can not get it. Not good for my daughter's education. That's where the COTOHA API comes in. Use the COTOHA API to read picture books emotionally.

As a rough flow,

  1. Extract text from images with Cloud Vision OCR
  2. Convert hiragana to kanji with google translate
  3. Correct conversion errors with voice recognition error detection (β) of COTOHA API
  4. Recognize the emotions of sentences by sentiment analysis of COTOHA API
  5. Analyze character personas with COTOHA API user attribute estimation (β)
  6. Use the HOYA VoiceText API to select the best speaker and speaking style, and synthesize speech. This is the procedure.

The result is saved as a txt or json file for each phase, and it is written to be used in the next phase.

1. Extract text from images with Cloud Vision OCR

I won't go into detail about this because it's not the main one this time. If you want to know more details, please refer to here written separately. The picture book used for this test is Garth Williams's "Shiroi Sagi and Kuroi Sagi". 500_Ehon_582.jpg

The reason I chose this was because it seemed to be easy to recognize, and because it was the first picture book I had bought and I had read it to death. The source code is below. Basically, English is removed under the assumption that only Japanese will appear.

Source code
import copy
from google.cloud import vision
from pathlib import Path
import re

def is_japanese(text):
    if re.search(r'[Ah-Hmm]', text):
        return True
    else:
        return False

client = vision.ImageAnnotatorClient()
row_list = []
res_list = []
text_path = "./ehon_text/text.txt"

with open(text_path, 'w') as f:
    for x in range(1, 15):
        p = Path(__file__).parent / "ehon_image/{}.png ".format(x)
        with p.open('rb') as image_file:
            content = image_file.read()
        image = vision.types.Image(content=content)
        response = client.text_detection(image=image)
        if len(response.text_annotations) == 0:
            row_list.append("-")
        for lines in response.text_annotations:
            if lines.locale != "ja":
                for text in str(lines.description).split("\n"):
                    if is_japanese(text):
                        print(text)
                        f.write(text + '\n')
            else:
                print(lines.description)
                f.write(lines.description)
            break
        f.write("\n")

The execution result looks like the following (partial excerpt) It is not 100% as expected, but it is moderately accurate. Most of the sentences are hiragana and may be easy to recognize. It seems that "ki" and "sa" and "po" and "bo" are difficult, so I often make a mistake. Since the characters in this picture book are smaller than the pictures as a whole, there is a big problem with resolution. I recognized it correctly when I took a large picture of only the letters. Fortunately, the text of the result is not displayed this time, and the voice is synthesized, so even if "dandelion" becomes "dandelion", it feels like it was read for a moment, and there is no strong sense of discomfort. You shouldn't be exposed to your daughter.

After a while, the black rabbit sat down.
And I did something that looked very good.
"what's wrong with you?」
I heard a white sword.
"Yeah, I was thinking for a moment."
The black rabbit answered.

2. Convert hiragana to kanji with google translate

Hiragana is more appreciated when it comes to recognition from images, but subsequent operations using text will give better results in sentences with kanji and kana (should). Japanese is a Mendokusai language, and the difficulty level of understanding the program changes greatly depending on whether it is a mixture of kanji or kana or hiragana. It is difficult to analyze the meaning of hiragana, which has only sound information. The intonation of the reading aloud at the time of voice synthesis is also different, and the accuracy when applying the analysis should also include kanji.

For the time being, I used google translate this time. The source code is below.

Source code
import urllib
import json

kanji_text_path = "./ehon_text/kanji_text.txt"

with open('./ehon_text/text.txt', 'r') as f:
    lines = f.readlines()

url = "http://www.google.com/transliterate?"
kanji_text = ""

with open('./ehon_text/kanji_text.txt', 'w') as f:
    for line in lines:
        if line == "\n":
            f.write(line)
        else:
            param = {'langpair':'ja-Hira|ja','text':line.strip().replace(' ','').replace(' ','')}
            paramStr = urllib.parse.urlencode(param)
            readObj = urllib.request.urlopen(url + paramStr)
            response = readObj.read()
            data = json.loads(response)
            for text in data:
                kanji_text += text[1][0]
            print(kanji_text)
            f.write(kanji_text)
            kanji_text = ""

The execution result looks like this. It's painful that "What's wrong?" Becomes "Assimilated?" .. Kinpouge has also been converted properly, but this may not be so good as the speech synthesis API will be read aloud or subtly.

After a while, the black rabbit sat down.
And he made a very sad face.
"Did you assimilate?" I heard a white rabbit song.
"Yeah, I was thinking for a moment," replied the black rabbit.
Then, Nibiki played hide-and-seek in the field where daisies and buttercups were in bloom.

3. Correct conversion errors with voice recognition error detection (β) of COTOHA API

At this point, I was a little interested, so I thought that it would be possible to correct the sentence mixed with the above conversion errors by applying speech recognition error detection (β), so I tried it. Even in speech recognition, if the utterance is short, the parsing is insufficient and erroneous conversion occurs. Since it corrects it, even if it is used for this purpose, it should be suitable for usage. The source code is below. For the time being, only those with a reliability of over 0.9 are replaced with the results of the first candidate.

Source code


import requests
import json

access_token_publish_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
api_base_url = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
clientid = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
clientsecret = "XXXXXXXXXXXXXXXXXXXXX"

headers = {'Content-Type': 'application/json',}
data = json.dumps({"grantType": "client_credentials","clientId": clientid,"clientSecret": clientsecret})
response = requests.post(access_token_publish_url, headers=headers, data=data)
print(response)
access_token = json.loads(response.text)["access_token"]

api_url = api_base_url + "nlp/beta/detect_misrecognition"
headers = {"Authorization": "Bearer " + access_token, "Content-Type": "application/json;charset=UTF-8"}

with open('./ehon_text/kanji_text.txt', 'r') as f:
    lines = f.readlines()

with open('./ehon_text/kanji_text2.txt', 'w') as f:
    for line in lines:
        print(line)
        data = json.dumps({"sentence": line})
        response = requests.post(api_url, headers=headers, data=data)
        result = json.loads(response.text)
        if result["result"]["score"] > 0.9:
            for candidate in result["result"]["candidates"]:
                if candidate["detect_score"] > 0.9:
                    line = line.replace(candidate["form"], candidate["correction"][0]["form"])
        # print(response)
        # print(json.loads(response.text))
        print(line)
        f.write(line)


The result is as follows: In google transrate, all "two" were converted to "double", but some (but not all) of these have been improved. There was no part that was aggravated, so I think this is the correct answer. (I wonder if rabbits are counted as animals)

before

Every morning, Nibiki jumped up from his bed and jumped into the morning light. And we enjoyed playing together all day long.

after

Every morning, the two jumped up from their beds and jumped into the morning light. And we enjoyed playing together all day long.

4. Recognize the emotions of sentences by sentiment analysis of COTOHA API

With the COTOHA API, you can extract emotional words from text and take negatives and positives for the entire sentence. Actually, there are some speech synthesiss that can give emotions as parameters, so if this result can be used as a parameter during speech synthesis, it should be possible to read aloud with more emotions. Also, I will not use it this time because I am not doing voice recognition, but depending on how I use it, I may be able to take detailed emotions such as "there is nothing" of the user.

There are many things that deal with emotions, such as those that simply give only negatives and positives as a result, and those that return multiple emotions such as happy, sad, angry, etc. as a percentage, but in the COTOHA API, the former and features regarding the entire sentence The latter is close to the unit of a typical word.

This time, I was planning to separate the voice between the white rabbit and the black rabbit, depending on the narrator.

"What's wrong (sad)" said Rabbit (happy)

Like, I felt that it would be strange if there was a difference in emotions between these three people in one sentence, and I thought that simply a long sample would be easier to get results, so I throw it in the API for "sentence" It is a unit.

The source code is below.

Source code


import requests
import json
import copy

access_token_publish_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
api_base_url = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
clientid = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
clientsecret = "XXXXXXXXXXXXXXXXXXXXX"

headers = {'Content-Type': 'application/json',}
data = json.dumps({"grantType": "client_credentials","clientId": clientid,"clientSecret": clientsecret})
response = requests.post(access_token_publish_url, headers=headers, data=data)
access_token = json.loads(response.text)["access_token"]

api_url = api_base_url + "nlp/v1/sentiment"
headers = {"Authorization": "Bearer " + access_token, "Content-Type": "application/json;charset=UTF-8"}

with open('./ehon_text/kanji_text2.txt', 'r') as f:
    lines = f.readlines()

story = []
text_list = []
page_sentenses = []
aa = {"sentiment": "", "text": ""}
with open('./ehon_json/ehon.json', 'w') as f:
    for line in lines:
        for text in line.split("。"):
            if text != "\n":
                data = json.dumps({"sentence": text})
                response = requests.post(api_url, headers=headers, data=data)
                result = json.loads(response.text)
                # print(text)
                # print(result["result"]["sentiment"])
                text_list.append({"sentiment": result["result"]["sentiment"], "text": text})
        story.append(copy.deepcopy(text_list))
        text_list = []
    json.dump(story, f, indent=4, ensure_ascii=False)

The result (an example of response) looks like the following. I thought that it would be only Neutral in writing, but there are unexpected ups and downs of emotions. Both the negatives and the positives came out properly, so I think it plays a role in the emotional reading.

Every morning, they jumped out of their beds and jumped into the morning light.
{'result': {'sentiment': 'Neutral', 'score': 0.3747452771403413, 'emotional_phrase': []}, 'status': 0, 'message': 'OK'}
And I made a very sad face
{'result': {'sentiment': 'Negative', 'score': 0.6020340536995118, 'emotional_phrase': [{'form': 'Looks very sad', 'emotion': 'N'}]}, 'status': 0, 'message': 'OK'}

5. Analyze character personas with COTOHA API user attribute estimation (β)

The COTOHA API has a user attribute estimation (β) function, and a fairly detailed persona is returned. Since the number of speakers is also larger in speech synthesis, I wondered if it would be possible to automatically match the speakers from this information. I really wanted to do everything automatically in the program, but I couldn't think of any logic to decide which utterance belonged to whom. .. This time, this has been done manually. In the case of Japanese picture books, the lines are often properly enclosed in "", so The specification is to first enter how many characters there are, extract the contents of "" with a regular expression, and ask the user to assign an id for each utterance. The narrator's id is set to 0. The source code is below

Source code

import requests
import re
import json

char0 = []
char_num = int(input("Please input number of characters =>"))
for i in range(1, char_num+1):
    exec('char{} = []'.format(i))

with open('./ehon_json/ehon.json', 'r') as f:
    story = json.load(f)

story_list = []
for page in story:
    page_list = []
    for sentense in page:
        # try:
        speech_list = re.split("(?<=」)|(?=「)", sentense["text"])
        for speech in speech_list:
            if speech != "":
                if speech.find("「") > -1:
                    while True:
                        try:
                            print(sentense)
                            print(speech)
                            id = int(input("Please input char ID =>"))
                            if id <= char_num and id > 0:
                                break
                        except:
                            print("once again")
                    exec('char{}.append(speech)'.format(id))
                    page_list.append({"sentiment": sentense["sentiment"], "text": speech, "char": id})
                else:
                    char0.append(speech)
                    page_list.append({"sentiment": sentense["sentiment"], "text": speech, "char": 0})
    story_list.append(copy.deepcopy(page_list))
print(story_list)

access_token_publish_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
api_base_url = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
clientid = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
clientsecret = "XXXXXXXXXXXXXXXXXXXXX"

headers = {'Content-Type': 'application/json',}
data = json.dumps({"grantType": "client_credentials","clientId": clientid,"clientSecret": clientsecret})
response = requests.post(access_token_publish_url, headers=headers, data=data)
access_token = json.loads(response.text)["access_token"]

api_url = api_base_url + "nlp/beta/user_attribute"
headers = {"Authorization": "Bearer " + access_token, "Content-Type": "application/json;charset=UTF-8"}

char_list = []
for i in range(0, char_num+1):
    exec('l = char{}'.format(i))
    data = json.dumps({"document": l})
    response = requests.post(api_url, headers=headers, data=data)
    result = json.loads(response.text)
    char_list.append(result)
    print(result)

with open('./ehon_json/char.json', 'w') as f:
    json.dump(char_list, f, indent=4, ensure_ascii=False)


In this way, I made a list of utterances for each speaker, including the narrator, the white rabbit, and the black rabbit, and threw them into the API. Click here for the results.

·narrator

{
        "result": {
            "age": "40-49 years old",
            "civilstatus": "married",
            "habit": [
                "SMOKING"
            ],
            "hobby": [
                "COLLECTION",
                "COOKING",
                "FORTUNE",
                "GOURMET",
                "INTERNET",
                "SHOPPING",
                "STUDY",
                "TVGAME"
            ],
            "location": "Kinki",
            "occupation": "employee"
        },
        "status": 0,
        "message": "OK"
    }

・ White rabbit

    {
        "result": {
            "age": "40-49 years old",
            "civilstatus": "married",
            "earnings": "-1M",
            "hobby": [
                "COOKING",
                "GOURMET",
                "INTERNET",
                "TVDRAMA"
            ],
            "location": "Kanto",
            "occupation": "employee"
        },
        "status": 0,
        "message": "OK"
    }

・ Kuroi Rabbit

    {
        "result": {
            "age": "40-49 years old",
            "earnings": "-1M",
            "hobby": [
                "INTERNET"
            ],
            "location": "Kanto",
            "occupation": "employee"
        },
        "status": 0,
        "message": "OK"
    }

From the top, the narrator, the white rabbit, and the black rabbit. Hmmm? This result may have been a little disappointing. Or rather, the documentation said it would return something like "gender", but it wasn't included in the results. Is it still in beta? But it's a story of getting married, and I think it's an adult, so maybe it's unexpectedly correct. I wonder if it would be impossible to send a huge conversation log if I wanted to get it with such accuracy.

If the accuracy here increases, it may be possible to create a template for each character to some extent, and use voice recognition to talk with the characters in the picture book. For the time being, this time, I manually selected a voice like that based on this result.

6. Use the HOYA VoiceText API to select the best speaker and speaking style, and synthesize speech.

Finally, these information are combined and voice-synthesized. Regarding voice synthesis, I couldn't specify emotions for the COTOHA API, and since I only registered for the free plan, I tried using HOYA's VOICE TEXT this time. Actually, I wanted to make a voice synthesis with my own voice at the coe station and make it an app that my dad would read at any time, but I couldn't do it by myself.

By the way, please note that HOYA's synthetic voice is also a license that prohibits secondary distribution etc.

VoixeText Web API

Commercial use, secondary use and distribution of audio data created with the free version is prohibited. Please check the terms of use before using this service.

This time

narrator:"hikari"
White Rabbit:"haruka"
Kuroi Rabbit:"takeru"

And said. Also, emotions

    "Neutral":""
    "Positive":"happiness"
    "Negative":"sadness"

It is set as. Also, since the voice is interrupted because the buffer at the back is insufficient when compositing normally, the SSML tag <vt_pause = 1000 /> is added after all the words to lengthen the file.

Source code

from voicetext import VoiceText
import copy
import json

speaker = {
    0:"hikari",
    1:"haruka",
    2:"takeru"
}

emotion = {
    "Neutral":"",
    "Positive":"happiness",
    "Negative":"sadness"
}

play_list = []
vt = VoiceText('XXXXXXXXXXXXXXXXX')
with open('./ehon_json/story.json', 'r') as f:
    story = json.load(f)
    for i, page in enumerate(story):
        play = {"image": "./ehon_image/{}.png ".format(i+1), "voice":[]}
        voice_list = []
        for j, speech in enumerate(page):
            print(speech)
            if speech["sentiment"] == "Neutral":
                vt.speaker(speaker[speech["char"]])
            else:
                vt.speaker(speaker[speech["char"]]).emotion(emotion[speech["sentiment"]])
            with open('./ehon_speech/{}_{}.wav'.format(i+1, j+1), 'wb') as f:
                print(speech["text"])
                f.write(vt.to_wave(speech["text"] + '<vt_pause=1000/>'))
            voice_list.append('./ehon_speech/{}_{}.wav'.format(i+1, j+1))
        play["voice"] = copy.deepcopy(voice_list)
        play_list.append(copy.deepcopy(play))
        voice_list = []


with open('./play_json/play.json', 'w') as f:
    json.dump(play_list, f, indent=4, ensure_ascii=False)


Finally

The audio generated this time by these methods is played back in synchronization with the loaded image.

For the time being, I will only put it on a part. I made it this time and it was quite interesting. As a future prospect, it would be interesting to make it into a device as a "picture book reading camera" such as Raspberry Pi, and I thought it would be good to connect it with a projector to make it a theater. If you connect it better with the Vision API, you can create an interesting experience by linking words and images. It might be interesting to write a picture book by yourself or talk about graffiti on the picture book. Emotions can be taken in fairly small units, so BGM and sound effects can be added with a little more modification.

The COTOHA API seems to be more playable, so I'd like to write an article if I continue to implement it.

For the time being, I will read the picture book for my daughter, of course. By the way, my daughter is now 1.5 months old.

Recommended Posts

A story of reading a picture book by synthesizing voice with COTOHA API and Cloud Vision API
Flow of extracting text in PDF with Cloud Vision API
Correspondence analysis of sentences with COTOHA API and save to file
A story that visualizes the present of Qiita with Qiita API + Elasticsearch + Kibana
The story of making a sound camera with Touch Designer and ReSpeaker
Implement a model with state and behavior (3) --Example of implementation by decorator
Transcription of images with GCP's Vision API
Get a list of articles posted by users with Python 3 Qiita API v2
The story of making a web application that records extensive reading with Django
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
Problems with output results with Google's Cloud Vision API
Text extraction with GCP Cloud Vision API (Python3.6)
Automatic voice transcription with Google Cloud Speech API
Example of reading and writing CSV with Python
Get data labels by linking with Google Cloud Vision API when previewing images with Rails