Speech recognition of wav files with Google Cloud Speech API Beta

*** Information as of August 2016 ***

A trial note of voice recognition for wav files on Google Cloud Speech API Beta.

CLOUD SPEECH API

As you can see in Google Cloud Speech API Beta, the API for speech recognition.

--Supports 80 languages --Resistant to noise --Contextual recognition --Device independent --Supports both real-time and recorded files

It seems to be an easy-to-use high-performance ASR.

Document

Official documentation python sample code

How to use from CLI (Google Cloud SDK + curl)

According to Quickstart

Create a Google Cloud Platform account
Create a project and enable the Speech API
Generate a Service Account key file and download it at hand
Install the command line tool Google Cloud SDK and use the above Service Account key file to get an authentication token.
Using the obtained authentication token, throw voice data such as wav files prepared in advance to the API to obtain the recognition result.

Generate a Service Account key file (json) containing the private key and use it to get an authentication token each time.

From project creation to service account key file acquisition

As per the Set Up Your Project section of Quick Start.

However, when creating a "new service account" with 6 Service Account creation, there is an item called Role that is not in Document. I'm confused.

After registering the Service Account, you can download the json file, so save it in any location. Do not expose it to the public as it contains a private key.

Get an authentication token with the Google Cloud SDK

Install the Google Cloud SDK so that you can hit the `` `gcloud``` command.
Obtain an authentication token using the Service Account key file obtained above

$ gcloud auth print-access-token

Remember the authentication token that came back

API call with Curl

Create `` `sync-request.json``` as per Make a Speech API Request in QuickStart and

`sync-request.json`


{
  "config": {
      "encoding":"FLAC",
      "sample_rate": 16000
  },
  "audio": {
      "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
  }
}

In the directory where sync-request.json is

$ curl -s -k -H "Content-Type: application/json" \
    -H "Authorization:Authentication token obtained on Bearer" \
    https://speech.googleapis.com/v1beta1/speech:syncrecognize \
    -d @sync-request.json

Hopefully json will return the recognition result.

How to set voice data and recognition contents

The location and format settings of the input file are specified in the Request body with json (`sync-request.json``` in the above example). The example `sync-request.json``` uses a sample flac file pre-located in Google Cloud Storage, but at hand Of course, it is also possible to send audio data of, and it also supports encoding other than flac.

Send the audio file you have

SyncRecognize of Rest API reference As per syncrecognize), specify the sound source and recognition settings with `` `configof Request body, and specify the audio data withaudio```.

`The audio specification is[RecognitionAudio](https://cloud.google.com/speech/reference/rest/v1beta1/RecognitionAudio)As you can see, if you want to send the audio file at hand with uri or content, you can encode it into a character string with Base64 and send it as content.`



 Since the encoding method of the sample is FLAC and the sampling rate is 16000 (16khz), match it with the audio data to be sent.

## Use Speech API with python

 As you can see in the [Tutorial](https://cloud.google.com/speech/docs/rest-tutorial), you can call the Speech API from python instead of the `` `glcoud``` command + curl (Node.js). There is also a sample)
 This procedure doesn't require the Google Cloud SDK, but instead requires the [Google API Client Library](https://developers.google.com/api-client-library/python/start/installation). I thought I didn't need a library because I could use curl, but [API Discovery Service](https://developers.google.com/discovery/) & Google API Client Library is used to get authentication tokens. If you don't need these, you can use it without a library by following curl mentioned above.

### Get Service Account key file

 Same as step 1-3 of CLI above.

### Application Default Credential settings

 The procedure is as per [Sample Code](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/api/speech_rest.py), but here is the Service Account key for getting the authentication token The file must be set to the environment variable ``` GOOGLE_APPLICATION_CREDENTIALS``` in advance:

 `` `$ export GOOGLE_APPLICATION_CREDENTIALS = Service Account file path` ```

 When the authentication token is obtained by referencing this as [Application Default Credential](https://cloud.google.com/speech/docs/common/auth#authenticating_with_application_default_credentials) by the GoogleCredentials.get_application_default (). create_scoped () method. That thing.

### API call

 As per [Sample Code](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/api/speech_rest.py):


#### **`$ python speech_rest.py audio file.wav`**

The recognition result is displayed with.

Caution

-* When recognizing Japanese voice, change `languageCode``` of body from ```en-US``` to `ja-UP. * --If you want to send FLAC encoded data, set encoding``` of the body to FLAC. --Since the recognition result is only json.dumps () in the sample, it is necessary to take measures so that it is displayed correctly when Japanese is recognized.

Since this sample is a process for one input file, if you want to recognize multiple files, it seems better not to repeat API Discover and token acquisition.

Since the authentication token seems to be updated at a reasonable frequency, care for token reacquisition is also required. What is the 401 suddenly returning during the test (experience 15-30 minutes?)? When I thought about it, the token was updated.

Usability

I'm sorry it's not quantitative:

--It takes some time to recognize (about 2-4 seconds?) --The recognition accuracy is quite high. Even if there is a fairly loud noise (playing music near the microphone), I can hear it properly. This accuracy is amazing without setting anything ――I want to try what happens when noise is a human voice --I haven't tried context-related options, so I'd like to use them in the future. --QuickStart says ** Learn in 5 minutes **, but 5 minutes was completely impossible for me and made me sad.