[Introduction to AWS] I tried playing with voice-text conversion ♪

It looks the same, but there is almost no material here. I managed to see the reference one. Last night's voice is posted. I converted this to text tonight. Voice generation

【reference】 ①Getting Started (AWS SDK for Python (Boto)) ② Transcribe the voice with Amazon Transcribe. ③ Create a transcription pipeline with S3 → Lambda → Transcribe → S3

From reference (1), the following code can be created. It looks almost the same as Reference ①, but looking at Reference ③ in one place, the output destination is specified as OutputBucketName ='bucket name'. Without this, I couldn't know where it was output.

Even if it exists by specifying it ~~ The file seems to be a hidden file ~~ ⇒ Finally, when I looked at it again, I saw the file

from __future__ import print_function
import time
import boto3
transcribe = boto3.client('transcribe')
job_name = "test_tran3"
job_uri = "https://Bucket name.s3.amazonaws.com/speech.mp3"
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='mp3',  #wav, mp4, mp3
    LanguageCode='ja-JP', #'en-US'
    OutputBucketName='muauanmp3'
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(5)
print(status)

The above code gives the following output. Not ready yet is output once every 5 seconds, but it seems that it takes about 30 seconds because it is output 6 times or more. And I spit out the result json, but I don't understand much.

$ python3 boto_transcribe.py
Not ready yet...
．．．
Not ready yet...
{'TranscriptionJob': {'TranscriptionJobName':..., 'content-length': '506', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

So, as Reference ② does, check the file in the s3 bucket. You can see files such as test_tran3.json output with the audio file speech.mp3.

$ aws s3 ls s3://Bucket name
2020-06-19 04:52:36          2 .write_access_check_file.temp
．．．
2020-06-18 23:44:17      35467 speech.mp3
．．．
2020-06-19 04:45:47       1472 test_tran2.json
2020-06-19 04:54:09       1663 test_tran3.json

Then copy s3: // bucket name /test_tran3.json to your ec2 server.

$ aws s3 cp s3://Bucket name/test_tran3.json ./
download: s3://Bucket name/test_tran3.json to ./test_tran3.json

Finally, output the contents of json with the following command. If the language is correct, the output is correct as shown below, but the result of transcribing the same audio file with the English specification is in the alphabet as shown below, but it is strange! However, this is a tentative voice-text conversion.

$ cat test_tran3.json |jq .results[][0].transcript
"Hello also Yokohama Tokyo cloudy little voice is Mizuki's"

$ cat test_tran2.json |jq .results[][0].transcript
"Tokyo, Yokohama, Moscow See Commodities, Cueva Mitic Sundays."

However, when actually using it, I still want to do it with python code instead of handwriting. So, as a result of various investigations, I found that the following references can be used.

【reference】 ④ Upload and download files to S3 using boto3 ⑤ Read JSON string / file with pandas (read_json) ⑥ Explanation of array nesting structure and value acquisition method in JSON using Python! If you drop these methods into your code, you get: In other words ① Download the json file ② Read with pandas ③ Output the required part That is the method.

import pandas as pd
s3 = boto3.resource('s3') #Get S3 object

bucket = s3.Bucket('Bucket name') #bucket definition
bucket.download_file('test_tran3.json', 'test_tran3.json') #Download to ec2; download file, file after download
df = pd.read_json('test_tran3.json') #Read json file with pandas

print(df['results'][1][0]['transcript']) #Extract conversion string from json file

As a result of a series of work, the following sentences were successfully obtained.

If you look closely at the results, you can see that they are divided.

Hello also Yokohama Tokyo cloudy little voice is Mizuki's

·variation

App application

It is a single item, and it seems that minutes and translations can be used normally. In addition, when combined with last night's text-speech, you can see that the following sequence can be constructed.

text-voice-．．．-voice-text

So ... .. .. There are various possible processes for the part. Record the reading of the papers and materials and the sequence of questions in text. In other words, the initial text / voice and the processed voice / text may be different. Also, other sequences are possible. In the case of a conversation app, the above arrangement is reversed.

voice-Text-Conversation App-Text-voice

It is possible that. This is a sequence like Alexa. In this case, it is a text-based conversion, so it seems that you can translate normally.

Voice QA

Voice like Alexa-I think I can make a QA app. If you accept questions by voice such as a smartphone and run the above application behind it, it seems that real-time voice QA can also be done.

Twitter assistance

It's not limited to Twitter, but the point is that input can be done by voice and output can be done by voice. .. .. .. However, you need to do your best to make these apps.

Summary

・ I played with voice-text conversion ・ I was able to create a series of actions with python

-If the json file exists, it cannot be done twice, so it is necessary to delete it in a series of sequences to do it with the same job every time. ・ Let's make some application. .. .. ・ Let's do text translation