History

I recently got into American football (NFL). However, I can't understand English. .. .. Even if you don't understand the voice, can you somehow decipher it by writing it? Thinking, I challenged to make voice text of player interviews with Raspberry Pi 3 × Julius × Watson (Speech to Text)

Thing you want to do

The image looks like this Getting robots to listen: Using Watson’s Speech to Text service

environment

Raspberry Pi3 --USB microphone (SANWA SUPPLY MM-MCUSB16 USB microphone) --julius 4.3.1 (Open Source Speech Recognition Library)
watson(Speech to text) --watson-developer-cloud-0.23.0 (python library for watson) --ws4py (webSocket library)

Premise

The following is assumed to be ready. For reference, list the link of the site that I referred to

--Enable the microphone on Raspberry Pi 3 -Easy to do! Conversation with Raspberry pi using speech recognition and speech synthesis -Try voice recognition and voice synthesis with Raspberry Pi 2 --Julius installation on Raspberry Pi 3 -Voice recognition by Julius-Utilization of domestic open source library --User registration to watson (It seems that all services can be used free of charge for one month after registration)

procedure

Talk to Raspberry Pi 3 using Julius (images ①②)
Voice recording (image ③)
Connect from Raspberry Pi 3 to watson (Speech to Text) (Image ④)
Textualize youtube player interview with Raspberry Pi 3 x watson (image ⑤)

■ Talk to Raspberry Pi 3 with Julius

Julius seems to have a reading file and a grammar file to speed up authentication. After trying both, I decided to use a grammar file this time.

Please refer to Raspberry Pi 3 x Julius (reading file and grammar file) for the verification result.

1.1 Overview of voice analysis processing

If you start Julius in module mode (*), the audio will be returned in XML. If you say "Start Watson", you will get the following XML.

<RECOGOUT>
  <SHYPO RANK="1" SCORE="-2903.453613" GRAM="0">
    <WHYPO WORD="Watson" CLASSID="Watson" PHONE="silB w a t o s o N silE" CM="0.791"/>
  </SHYPO>
</RECOGOUT>
<RECOGOUT>
  <SHYPO RANK="1" SCORE="-8478.763672" GRAM="0">
    <WHYPO WORD="Watson started" CLASSID="Watson started" PHONE="silB w a t o s o N k a i sh i silE" CM="1.000"/>
  </SHYPO>
</RECOGOUT>

Therefore, for the spoken word, parse the XML and describe the process to be executed. (It's not good, but it's solid ...)

#Judge and process voice
def decision_word(xml_list):
    watson = False
    for key, value in xml_list.items():
        if u"Raspberry pi" == key:
            print u"Yes. What is it?"
        if u"Watson" == key:
            print u"Roger that. prepare."
            watson = True
    return watson

1.2 Start Julius server and connect to Julius server from client side

The Julius server is now started as a subprocess

#Start julius server
def invoke_julius():
    logging.debug("invoke_julius")
    # -Prohibit log output with the nolog option
    reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"]
    p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None)
    time.sleep(3.0)
    return p

#Julius server
JULIUS_HOST = "localhost"
JULIUS_PORT = 10500

#Connect with julius
def create_socket():
    logging.debug("create_socket")
    # TCP/Connect to julius with IP
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((JULIUS_HOST, JULIUS_PORT))
    sock_file = sock.makefile()

    return sock

1.3 Voice analysis (XML analysis)

As mentioned above, XML is returned from Julius, so get the to </ RECOGOUT> tags from it and analyze it. *. If there is a ~~tag, an error will occur during XML parsing, so processing other than </ s> is included.~~

#Extract the specified tag from the data obtained from julius def extract_xml(tag_name, xml_in, xml_buff, line): xml = False final = False if line.startswith("<RECOGOUT>"): xml = True xml_buff = line elif line.startswith("</RECOGOUT>"): xml_buff += line final = True else: if xml_in: xml_buff += escape(line) xml = True return xml,xml_buff,final # <s>Removed tags (corresponding because an error occurred during XML parsing) def escape(line): str = line.replace("<s>",'') str = str.replace('</s>','') return str #Parse the XML of julius analysis results def parse_recogout(xml_data): #Get the word of the recognition result #Save results in dictionary word_list = [] score_list = [] xml_list = {} for i in xml_data.findall(".//WHYPO"): word = i.get("WORD") score = i.get("CM") if ("[s]" in word) == False: word_list.append(word) score_list.append(score) xml_list = dict(izip(word_list, score_list)) return xml_list

1.4 Overall

It's a little long, but the whole thing from 1.1 to 1.3 looks like this.

#Extract the specified tag from the data obtained from julius def extract_xml(tag_name, xml_in, xml_buff, line): xml = False final = False if line.startswith("<RECOGOUT>"): xml = True xml_buff = line elif line.startswith("</RECOGOUT>"): xml_buff += line final = True else: if xml_in: xml_buff += escape(line) xml = True return xml,xml_buff,final # <s>Removed tags (corresponding because an error occurred during XML parsing) def escape(line): str = line.replace("<s>",'') str = str.replace('</s>','') return str #Parse the XML of julius analysis results def parse_recogout(xml_data): #Get the word of the recognition result #Save results in dictionary word_list = [] score_list = [] xml_list = {} for i in xml_data.findall(".//WHYPO"): word = i.get("WORD") score = i.get("CM") if ("[s]" in word) == False: word_list.append(word) score_list.append(score) xml_list = dict(izip(word_list, score_list)) return xml_list #Judge and process voice def decision_word(xml_list): watson = False for key, value in xml_list.items(): if u"Raspberry pi" == key: print u"Yes. What is it?" if u"Watson" == key: print u"Roger that. prepare." watson = True return watson #Julius server JULIUS_HOST = "localhost" JULIUS_PORT = 10500 #Start julius server def invoke_julius(): logging.debug("invoke_julius") # -Prohibit logging with the nolog option #Soon,-Output the log to a file with the logfile option etc. reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"] p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None) time.sleep(3.0) return p #disconnect julius server def kill_process(julius): logging.debug("kill_process") julius.kill() time.sleep(3.0) #Connect with julius def create_socket(): logging.debug("create_socket") # TCP/Connect to julius with IP sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect((JULIUS_HOST, JULIUS_PORT)) sock_file = sock.makefile() return sock #Close connection with julius def close_socket(sock): logging.debug("close_socket") sock.close() #Main processing def main(): #Start julius server julius = invoke_julius() #Connect to julius sock = create_socket() julius_listening = True bufsize = 4096 xml_buff = "" xml_in = False xml_final = False watson = False while julius_listening: #Get analysis result from julius data = cStringIO.StringIO(sock.recv(bufsize)) #Get one line from the analysis result line = data.readline() while line: #Only the line showing the speech analysis result is extracted and processed. #Extract and process only the RECOGOUT tag. xml_in, xml_buff, xml_final = extract_xml('RECOGOUT', xml_in, xml_buff, line) if xml_final: #Analyze mxl logging.debug(xml_buff) xml_data = fromstring(xml_buff) watson = decision_word( parse_recogout(xml_data)) xml_final = False #If the result is "Watson", go to voice authentication if watson: julius_listening = False #Julius finished break #Get one line from the analysis result line = data.readline() #Close socket close_socket(sock) #Disconnect julius kill_process(julius)← Watson's voice authentication "Speech to text" records using arecord, so Julius disconnects (because the microphone device collides, ...) if watson: speechToText()← If you are told "Watson", execute the processes ③ and ④ def initial_setting(): #Log settings logging.basicConfig(filename='websocket_julius2.log', filemode='w', level=logging.DEBUG) logging.debug("initial_setting") if __name__ == "__main__": try: #Initialization process initial_setting() #Main processing main() except Exception as e: print "error occurred", e, traceback.format_exc() finally: print "websocket_julius2...end"

■ Voice recording

Start the voice recording process (execute the arecord command) in multithreading. Binary data will be sent to watson each time it is recorded so that the voice can be converted to text in real time. (* .Data exchange to watson will be described later)

def opened(self): self.stream_audio_thread = threading.Thread(target=self.stream_audio) self.stream_audio_thread.start() #Start recording process def stream_audio(self): # -Hide message with q option reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"] p = subprocess.Popen(reccmd,stdout=subprocess.PIPE) print 'Ready. Please voice' while self.listening: data = p.stdout.read(1024) try: self.send(bytearray(data), binary=True)← Pass binary data to watson except ssl.SSLError: pass

■ Connect from Raspberry Pi 3 to watson (Speech to Text)

Use the webSocket version of Speech to Text to convert voice to text in real time. For Speech to text, please also refer to I tried Watson Speech to Text.

Implemented with reference to this sample source Getting robots to listen: Using Watson’s Speech to Text service

3.1 Connect to watson (Speech to Text)

Connect to watson using the library for watson (watson-developer-cloud-0.23.0)

class SpeechToTextClient(WebSocketClient): def __init__(self): ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize" username = "XXXXXXX" password = "XXXXXXX" auth_string = "%s:%s" % (username, password) base64string = base64.encodestring(auth_string).replace("\n", "") self.listening = False try: WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)]) self.connect() except: print "Failed to open WebSocket."

3.2 Connect to watson with webSocket.

# websocket(Connect) def opened(self): self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}')

3.3 watson voice authentication

The execution result (voice data) of the arecord command executed in the multithread described above is sent to watson. It's a little long, but ... 2. Voice recording-3. When I put together the connection from Raspberry Pi 3 to watson (Speech to Text), it looks like this.

class SpeechToTextClient(WebSocketClient): def __init__(self): ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize" username = "XXXXXXX" password = "XXXXXXX" auth_string = "%s:%s" % (username, password) base64string = base64.encodestring(auth_string).replace("\n", "") self.listening = False try: WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)]) self.connect() except: print "Failed to open WebSocket." # websocket(Connect) def opened(self): self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}') self.stream_audio_thread = threading.Thread(target=self.stream_audio) self.stream_audio_thread.start() #Start recording process def stream_audio(self): while not self.listening: time.sleep(0.1) # -Hide message with q option reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"] p = subprocess.Popen(reccmd,stdout=subprocess.PIPE) print 'Ready. Please voice' while self.listening: data = p.stdout.read(1024) try: self.send(bytearray(data), binary=True) except ssl.SSLError: pass

■ Text of youtube player interviews with Raspberry Pi 3 x watson

4.1 Implementation of received_message

When connecting with webSocket, it seems that the analysis result from watson can be received in the received_message event.

# websockt(Receive message) def received_message(self, message): print message

4.2 watson analysis results

The analysis result seems to be returned as a json object.

With this kind of feeling, I was able to convert the voice into text in real time.

2017/4/16 postscript I made a video like this. https://youtu.be/IvWaHISF6nY

Finally

Impression that voice cannot be authenticated well when talking with multiple people or when there is music. Still, I thought it was simply amazing that the voice became text in real time. I want to play more and more with voice authentication.

Recommended Posts
Voice authentication & transcription with Raspberry Pi 3 x Julius x Watson (Speech to Text)

I tried Watson Speech to Text

Log in to Raspberry PI with ssh without password (key authentication)

English speech recognition with python [speech to text]

Raspberry Pi 3 x Julius (reading file and grammar file)

Automatic voice transcription with Google Cloud Speech API

Convert voice to text using Azure Speech SDK

Connect to MySQL with Python on Raspberry Pi

Easy IoT to start with Raspberry Pi and MESH

Try to visualize the room with Raspberry Pi, part 1

Use raspberry Pi and Julius (speech recognition). ③ Dictionary creation

GPGPU with Raspberry Pi

DigitalSignage with Raspberry Pi

Easy introduction to home hack with Raspberry Pi and discord.py

Update Python for Raspberry Pi to 3.7 or later with pyenv

I tried mushrooms Pepper x IBM Bluemix Text to Speech

Create an LCD (16x2) game with Raspberry Pi and Python

Connect Raspberry Pi to Alibaba Cloud IoT Platform with Python

Introduced python3-OpenCV3 to Raspberry Pi

Mutter plants with Raspberry Pi

I talked to Raspberry Pi

Introducing PyMySQL to raspberry pi3

Speech to speech in python [text to speech]

I tried to automate the watering of the planter with Raspberry Pi

I made a web server with Raspberry Pi to watch anime