[JAVA] Speech synthesis and speech recognition by Microsoft Project Oxford

Since May 1, 2015, Microsoft has released a machine learning API as part of a project called Project Oxford.

[Face, image, and voice recognition APIs available from Microsoft's Project Oxford](http://jp.techcrunch.com/2015/05/01/20150430microsofts-project-oxford-gives-developers-access-to- facial-image-and-speech-recognition-apis /)

This time, we will take up the Speech API that performs speech synthesis and speech recognition.

This is because there are quite a few services that synthesize speech, but when it comes to speech recognition, the ones that can be used as APIs are quite limited. Mostly Android / iOS SDKs, even though they can be used on the web, they are browser-dependent. Google also has a Speech API, but I can't find any official documentation, and the limit of 50 times a day is quite strict (as of July 2015. It doesn't seem to increase if you pay).

google_speech_api_quote.PNG

Project Oxford is Public Beta as of July 2015, and for now it is free and can be used without any restrictions (Japanese is also supported). There are APIs such as face recognition other than speech synthesis, so please try it at here.

Environmental preparation

First, prepare the environment for using the Speech API. A Microsoft Azure account is required to use it, so register it.

Microsoft Azure

There is a description that it is for one month, but since the Speech API used this time is free, I think that it is probably okay even after one month.

Once you have created an account, access the portal. The Speech API seems to be purchased via Marketplace, so press the "New" button at the bottom left and select Marketplace.

image

From here, select the Speech API. Since you can see the Face API etc., I think that you can purchase the API of Project Oxford by the same method (* Currently FREE).

image

After purchase, you can refer to the key required to access the API by pressing the "Manage" button below.

image

At this point, the environment preparation is complete.

Use of API

As for API, SDK exists as well as other speech recognition, but it can also be used in Web API format. You can download the SDK that suits your environment / use from the following.

Software Development Kit (SDK)

The official documentation is below.

This time, we will describe the usage in Web API format and sample code in Python3 (but any language such as JavaScript / Ruby / PHP / Java can be used as long as HTTP can be skipped). For HTTP Request in Python, the standard is quite difficult, so use requests. I want to use it quickly because it's kind of annoying! For those who say, I made a simple library below, so please try it here.

icoxfog417/pyoxford

Authentication

First, authenticate using the key required for API access obtained in the environment preparation earlier. There are two keys, but primary is client_id and secondary is client_secret (secret token). Below is a sample code for authentication (excerpt from the repository above).

    def authorize(self, client_id, client_secret):
        url = "https://oxford-speech.cloudapp.net//token/issueToken"

        headers = {
            "Content-type": "application/x-www-form-urlencoded"
        }

        params = urllib.parse.urlencode(
            {"grant_type": "client_credentials",
             "client_id": client_id,
             "client_secret": client_secret,
             "scope": "https://speech.platform.bing.com"}
        )

        response = requests.post(url, data=params, headers=headers)
        if response.ok:
            _body = response.json()
            return _body["access_token"]
        else:
            response.raise_for_status()

The authentication token (_body [" access_token "]) obtained here will be used for future synthesis / recognition.

Speech synthesis

Now, let's try speech synthesis. In the following, the argument text is the character string to be voice-synthesized, and token is the authentication token obtained earlier.

    def text_to_speech(self, text, token, lang="en-US", female=True):
        template = """
        <speak version='1.0' xml:lang='{0}'>
            <voice xml:lang='{0}' xml:gender='{1}' name='{2}'>
                {3}
            </voice>
        </speak>
        """

        url = "https://speech.platform.bing.com/synthesize"
        headers = {
            "Content-type": "application/ssml+xml",
            "X-Microsoft-OutputFormat": "riff-16khz-16bit-mono-pcm",
            "Authorization": "Bearer " + token,
            "X-Search-AppId": "07D3234E49CE426DAA29772419F436CA",
            "X-Search-ClientID": "1ECFAE91408841A480F00935DC390960",
            "User-Agent": "OXFORD_TEST"
        }
        name = "Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)"
        data = template.format(lang, "Female" if female else "Male", name, text)

        response = requests.post(url, data=data, headers=headers)

        if response.ok:
            return response.content
        else:
            raise response.raise_for_status()

As you can see in the template above, the request is sent in XML format for speech called SSML. About this, docomo site is detailed. The limit of voice that can be synthesized is 15 seconds. The result is returned in binary format, so if you save this as an audio file (.wav, etc.), you can listen to the synthesized audio.

Other detailed parameters are as follows.

voice recognition

Next, let's try voice recognition. Let's recognize it by using the content (binary) that was synthesized by voice as it is. It seems that it is supposed to be used to make text continuously while recognizing it, and it seems that the limit is 10 seconds at a time and 14 seconds in total (requestid unit?).

   def speech_to_text(self, binary, token, lang="en-US", samplerate=8000, scenarios="ulm"):
        data = binary
        params = {
            "version": "3.0",
            "appID": "D4D52672-91D7-4C74-8AD8-42B1D98141A5",
            "instanceid": "1ECFAE91408841A480F00935DC390960",
            "requestid": "b2c95ede-97eb-4c88-81e4-80f32d6aee54",
            "format": "json",
            "locale": lang,
            "device.os": "Windows7",
            "scenarios": scenarios,
        }

        url = "https://speech.platform.bing.com/recognize/query?" + urllib.parse.urlencode(params)
        headers = {"Content-type": "audio/wav; samplerate={0}".format(samplerate),
                   "Authorization": "Bearer " + token,
                   "X-Search-AppId": "07D3234E49CE426DAA29772419F436CA",
                   "X-Search-ClientID": "1ECFAE91408841A480F00935DC390960",
                   "User-Agent": "OXFORD_TEST"}

        response = requests.post(url, data=data, headers=headers)

        if response.ok:
            result = response.json()["results"][0]
            return result["lexical"]
        else:
            raise response.raise_for_status()

This is a request with a slightly acrobatic feeling as if both GET / POST, which is the information about the file with the query parameter and the file body with the body, are combined.

You can also optionally specify the following:

The return value returns some recognized character strings in descending order of probability. It is contained as an array in results, where lexical is the string and confidence is the accuracy.

That is all for the explanation. You can easily synthesize / recognize voice, so please give it a try.

Recommended Posts

Speech synthesis and speech recognition by Microsoft Project Oxford
Speech recognition by Python MFCC
Use raspberryPi and julius (speech recognition). ① Microphone edition
Use raspberryPi and Julius (speech recognition). ④ L Chika
Use raspberryPi and Julius (speech recognition). ⑤ i2c character display
Speech file recognition by Google Speech API v2 using Python
Use raspberry Pi and Julius (speech recognition). ③ Dictionary creation