Recently, I was investigating because I wanted to make a program using voice recognition, and when I heard rumors from various people that it was "accurate", I came up with the idea of using the Google Speech API. When I tried to move it, there were some places I was addicted to, so make a note of it.
Google Speech API is an API for using Google's speech recognition technology.
This API takes a voice file as input and outputs a plausible natural sentence corresponding to the voice with certainty. It seems that ver.2 is currently running.
Registration on the Google Developer Console is required to use the Google Speech API.
Google Developer Console
Google Developer Console https://console.developers.google.com/project
In this area, the interface seems to differ greatly depending on whether or not you have used the Google API before. I hope you can take your procedure as a reference only.
Click "Use Google API".
Create a project by entering an appropriate project name and project ID. Click the created project name to move to the dashboard.
I thought that the Speech API was included in the "Google API" tab and searched for it, but I couldn't find it (I was crazy about it here).
http://eyepodtouch.net/?p=81
According to this article, you need to join the Chrome-dev group to enable the Speech API from Japan.
https://groups.google.com/a/chromium.org/forum/?fromgroups#!forum/chromium-dev
Click "Join the group to post" to join (note that the mailing list will notify you of the mailing list?).
If you can join the group, you can search the Google Speech API, so click "Enable API" to enable it (since the second image has already been enabled, the button is "API" Is disabled).
From the "Credentials" tab
Select "New Credentials"-> "API Key"-> "Server Key".
Enter an appropriate server key name and click "Create" to get the key (character string) for the API, so make a note of this value.
Now you are ready to use the API.
The environment is OS: Mac OS X 10.11 Microphone: Macbook Air Mid 2013 built-in microphone
Here, let's use the API once according to the usage example in a certain Github repository (it is a little scary because it says "Reverse Engineering"). https://github.com/gillesdemey/google-speech-v2 This page also contains what seems to be API specifications, so you can refer to it.
$ cd
$ mkdir src/
$ cd src/
$ git clone https://github.com/gillesdemey/google-speech-v2.git
$ cd google-speech-v2/
$ curl -X POST \
--data-binary @'audio/hello (16bit PCM).wav' \
--header 'Content-Type: audio/l16; rate=16000;' \
'https://www.google.com/speech-api/v2/recognize?output=json&lang=en-us&key=<your api key>'
# {"result":[]}
# {"result":[{"alternative":[{"transcript":"hello Google","confidence":0.98762906},{"transcript":"hello Google Now"},{"transcript":"hello Google I"}],"final":true}],"result_index":0}
For WAV files, as shown on the above page,
#In an environment where homebrew can be used
$ brew install sox
$ rec --encoding signed-integer --bits 16 --channels 1 --rate 16000 test.wav
rec WARN formats: can't set sample rate 16000; using 44100
rec WARN formats: can't set 1 channels; using 2
#↑ Although there is such a description, the actually created file has a sampling frequency of 16,000Hz, 1 channel ...
Input File : 'default' (coreaudio)
Channels : 2
Sample Rate : 44100
Precision : 32-bit
Sample Encoding: 32-bit Signed Integer PCM
In:0.00% 00:00:11.15 [00:00:00.00] Out:177k [ | ] Clip:0
#Recording with the built-in microphone. Ctrl+End with C
By
SoX http://sox.sourceforge.net/ It seems that you can record using. (Of course, even if you don't use this, it's OK if you can prepare a 16-bit PCM sound source.)
Here, I tried to record the voice of "Hello" to try.
Type | value |
---|---|
file name | test.wav |
format | WAVE |
Bit number | 16 |
Modulation method | PCM |
Number of channels | 2 |
For this test.wav
$ curl -X POST \
--data-binary @'test.wav' \
--header 'Content-Type: audio/l16; rate=16000;' \
'https://www.google.com/speech-api/v2/recognize?output=json&lang=ja-JP&key=<your api key>'
{"result":[]}
{"result":[{"alternative":[{"transcript":"Hello","confidence":0.95324326},{"transcript":"Hello"},{"transcript":"Hello"},{"transcript":"Hello"}],"final":true}],"result_index":0}
By doing so, the character string corresponding to the properly recorded voice was obtained.
It is necessary to describe the MIME type and sampling frequency of the sound source in the request header. Also note that the parameter lang is subtly changed from'en-js'to'ja-JP' to support Japanese.
From here it's a complete bonus. If it supports HTTP, you can send a request from a script as well.
Pepper Hands-on-Basic B Speech Recognition-Get Pepper to know your name using Google Speech API [Tech-Circle # 7] ↑ I referred to this.
Prepare the voice data you want to recognize in WAV format as before (test.wav). Suppose this file is in the same directory as the script below.
Below is the Python3 code. For example, test.py.
test.py
import sys
import json
import urllib.parse
import urllib.request
apikey = 'your api key'
endpoint = 'http://www.google.com/speech-api/v2/recognize'
query_string = {'output': 'json', 'lang': 'ja-JP', 'key': apikey}
url = '{0}?{1}'.format(endpoint, urllib.parse.urlencode(query_string))
headers = {'Content-Type': 'audio/l16; rate=16000'}
voice_data = open(sys.argv[1], 'rb').read()
request = urllib.request.Request(url, data=voice_data, headers=headers)
response = urllib.request.urlopen(request).read()
#Since the output is JSON with multiple lines, delete the ones that seem unnecessary
for line in response.decode('utf-8').split():
if not line:
continue
else:
res = json.loads(line)
if res['result'] == []:
continue
else:
print(res)
$ python test.py test.wav
{'result_index': 0, 'result': [{'final': True, 'alternative': [{'confidence': 0.95324326, 'transcript': 'Hello'}, {'transcript': 'Hello'}, {'transcript': 'Hello'}, {'transcript': 'Hello'}]}]}
I think the accuracy is still high.
Even if you say a long sentence or something that is a little hard to hear, it will pick you up.
The downside is that it takes a long time to execute, and the accuracy varies depending on the file, so it seems a little difficult to use it in an application that requires real-time performance and a certain degree of accuracy. I think the guy running on the smartphone is faster and more accurate. </ S>
Even after deducting that, I thought it was purely amazing that this could be used for free.
Speaking of greed, the interface of the API registration is completely different from the reference article I tried about a year ago, I can not use it unless I enter the mailing list in the first place, I can not find the official specifications, and I think that various things are terrible, so please improve it. is not it…
On the use of Google’s Speech Recognition API Version 2 | Similar Sounds How to use Google Speech API ver.2 Using Google Speech API from Ruby-Qiita How to get API key for Google Speech API | Eyepod touch
Recommended Posts