The API was updated in August 2017 to allow voice for up to 3 hours. I tried to convert voice data to txt file. The environment uses GCP's cloud console, which can be used on the go, so that it can be automatically transcribed as soon as an interview is taken.
※reference http://jp.techcrunch.com/2017/08/15/20170814google-updates-its-cloud-speech-api-with-support-for-more-languages-word-level-timestamps/
Enable the Speech API by referring to the URL below. Free for up to 60 minutes of audio, after which you will be charged 0.6 cents every 15 seconds, but if you are using Google Cloud Platform for the first time, you will be granted $ 300, which is valid for one year (as of August 2017) https://cloud.google.com/speech/docs/getting-started
Create the authentication information in the service account key file (JSON format).
Launch Google Cloud Shell and upload the JSON file for authentication from the upper right corner.
After uploading, authenticate with the JSON file.
python
$ export GOOGLE_APPLICATION_CREDENTIALS=hogehoge.json
You cannot use mp3, AAC, etc. as they are, and you need to convert them to a compatible format. I tried various things, but the following settings are recommended.
(Reference: Online conversion service) https://audio.online-convert.com/convert-to-flac
Upload the FLAC file to Google Cloud Strage. Click here for how to make Google Cloud Storage https://cloud.google.com/storage/docs/quickstart-console?hl=ja
I uploaded the python file directly to the shell. I'm not a main engineer, so while watching the tutorial, I'm gonna go ...
transcribe.py
# !/usr/bin/env python
# coding: utf-8
import argparse
import io
import sys
import codecs
import datetime
import locale
def transcribe_gcs(gcs_uri):
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
language_code='ja-JP')
operation = client.long_running_recognize(config, audio)
print('Waiting for operation to complete...')
operationResult = operation.result()
d = datetime.datetime.today()
today = d.strftime("%Y%m%d-%H%M%S")
fout = codecs.open('output{}.txt'.format(today), 'a', 'shift_jis')
for result in operationResult.results:
for alternative in result.alternatives:
fout.write(u'{}\n'.format(alternative.transcript))
fout.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument(
'path', help='GCS path for audio file to be recognized')
args = parser.parse_args()
transcribe_gcs(args.path)
Finally, do the following and wait for a while to finish the conversion.
python
$ python transcribe.py gs://Bucket name/testmusic.flac
python
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=16000, #Add this line
language_code='ja-JP')
python
sudo pip install --upgrade google-cloud-speech
It is surprising that the echo of the room affects the accuracy considerably. Noise such as the sound of air conditioning did not affect the accuracy even if it was quite noisy. It may be easy to separate.
Recommended Posts