A dream of automatically transcribing conference and interview sound sources

The API was updated in August 2017 to allow voice for up to 3 hours. I tried to convert voice data to txt file. The environment uses GCP's cloud console, which can be used on the go, so that it can be automatically transcribed as soon as an interview is taken.

※reference http://jp.techcrunch.com/2017/08/15/20170814google-updates-its-cloud-speech-api-with-support-for-more-languages-word-level-timestamps/

Environment, language, etc.

Google Cloud Speech API
Google Cloud Storage
python

Enable Speech API

Enable the Speech API by referring to the URL below. Free for up to 60 minutes of audio, after which you will be charged 0.6 cents every 15 seconds, but if you are using Google Cloud Platform for the first time, you will be granted $ 300, which is valid for one year (as of August 2017) https://cloud.google.com/speech/docs/getting-started

Create the authentication information in the service account key file (JSON format).

API authentication with Google Cloud Shell

Launch Google Cloud Shell and upload the JSON file for authentication from the upper right corner.

After uploading, authenticate with the JSON file.

`python`


$ export GOOGLE_APPLICATION_CREDENTIALS=hogehoge.json

Create audio file

You cannot use mp3, AAC, etc. as they are, and you need to convert them to a compatible format. I tried various things, but the following settings are recommended.

FLAC
Monaural
16000Hz
16bit

(Reference: Online conversion service) https://audio.online-convert.com/convert-to-flac

conversion

Upload the FLAC file to Google Cloud Strage. Click here for how to make Google Cloud Storage https://cloud.google.com/storage/docs/quickstart-console?hl=ja

I uploaded the python file directly to the shell. I'm not a main engineer, so while watching the tutorial, I'm gonna go ...

`transcribe.py`


# !/usr/bin/env python
# coding: utf-8
import argparse
import io
import sys
import codecs
import datetime
import locale

def transcribe_gcs(gcs_uri):
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        language_code='ja-JP')

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    operationResult = operation.result()

    d = datetime.datetime.today()
    today = d.strftime("%Y%m%d-%H%M%S")
    fout = codecs.open('output{}.txt'.format(today), 'a', 'shift_jis')

    for result in operationResult.results:
      for alternative in result.alternatives:
          fout.write(u'{}\n'.format(alternative.transcript))
    fout.close()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'path', help='GCS path for audio file to be recognized')
    args = parser.parse_args()
    transcribe_gcs(args.path)

Finally, do the following and wait for a while to finish the conversion.

`python`


$ python transcribe.py gs://Bucket name/testmusic.flac

Caution

File up to 3 hours
It takes about 15 minutes to transcribe the voice for 1 hour.
Since it is a spoken language, there are no punctuation marks at all (the English version seems to be able to automatically add punctuation marks, so the release of the Japanese version is awaited)
Occasionally, the error "Hertz setting is different" appears. In that case, set the sampling rate value in the python file.

`python`


config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000, #Add this line
        language_code='ja-JP')

When I tried to use it for the first time in a long time, ImportError: cannot import name speech came out, so I updated it.

`python`


sudo pip install --upgrade google-cloud-speech

Accuracy (impression)

Things that are not related to accuracy

Microphone sensitivity
Speaking speed
noise

Things related to accuracy

Speaker's way of speaking (whether clear or not)
Room response

It is surprising that the echo of the room affects the accuracy considerably. Noise such as the sound of air conditioning did not affect the accuracy even if it was quite noisy. It may be easy to separate.

Automatic voice transcription with Google Cloud Speech API

A dream of automatically transcribing conference and interview sound sources

Environment, language, etc.

Enable Speech API

API authentication with Google Cloud Shell

python

Create audio file

conversion

transcribe.py

python

Caution

python

python

Accuracy (impression)

Things that are not related to accuracy

Things related to accuracy

`python`

`transcribe.py`

`python`

`python`

`python`