I recently got into American football (NFL). However, I can't understand English. .. .. Even if you don't understand the voice, can you somehow decipher it by writing it? Thinking, I challenged to make voice text of player interviews with Raspberry Pi 3 × Julius × Watson (Speech to Text)

The image looks like this Getting robots to listen: Using Watson’s Speech to Text service
The following is assumed to be ready. For reference, list the link of the site that I referred to
--Enable the microphone on Raspberry Pi 3 -Easy to do! Conversation with Raspberry pi using speech recognition and speech synthesis -Try voice recognition and voice synthesis with Raspberry Pi 2 --Julius installation on Raspberry Pi 3 -Voice recognition by Julius-Utilization of domestic open source library --User registration to watson (It seems that all services can be used free of charge for one month after registration)
Julius seems to have a reading file and a grammar file to speed up authentication. After trying both, I decided to use a grammar file this time.
Please refer to Raspberry Pi 3 x Julius (reading file and grammar file) for the verification result.
If you start Julius in module mode (*), the audio will be returned in XML. If you say "Start Watson", you will get the following XML.
<RECOGOUT>
  <SHYPO RANK="1" SCORE="-2903.453613" GRAM="0">
    <WHYPO WORD="Watson" CLASSID="Watson" PHONE="silB w a t o s o N silE" CM="0.791"/>
  </SHYPO>
</RECOGOUT>
<RECOGOUT>
  <SHYPO RANK="1" SCORE="-8478.763672" GRAM="0">
    <WHYPO WORD="Watson started" CLASSID="Watson started" PHONE="silB w a t o s o N k a i sh i silE" CM="1.000"/>
  </SHYPO>
</RECOGOUT>
Therefore, for the spoken word, parse the XML and describe the process to be executed. (It's not good, but it's solid ...)
#Judge and process voice
def decision_word(xml_list):
    watson = False
    for key, value in xml_list.items():
        if u"Raspberry pi" == key:
            print u"Yes. What is it?"
        if u"Watson" == key:
            print u"Roger that. prepare."
            watson = True
    return watson
The Julius server is now started as a subprocess
#Start julius server
def invoke_julius():
    logging.debug("invoke_julius")
    # -Prohibit log output with the nolog option
    reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"]
    p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None)
    time.sleep(3.0)
    return p
#Julius server
JULIUS_HOST = "localhost"
JULIUS_PORT = 10500
#Connect with julius
def create_socket():
    logging.debug("create_socket")
    # TCP/Connect to julius with IP
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((JULIUS_HOST, JULIUS_PORT))
    sock_file = sock.makefile()
    return sock
As mentioned above, XML is returned from Julius, so get the  tag, an error will occur during XML parsing, so processing other than </ s> is included.
#Extract the specified tag from the data obtained from julius
def extract_xml(tag_name, xml_in, xml_buff, line):
    xml = False
    final = False
    if line.startswith("<RECOGOUT>"):
        xml = True
        xml_buff = line
    elif line.startswith("</RECOGOUT>"):
        xml_buff += line 
        final = True
    else:
        if xml_in:
            xml_buff += escape(line) 
            xml = True
                
    return xml,xml_buff,final
# <s>Removed tags (corresponding because an error occurred during XML parsing)
def escape(line):
    str = line.replace("<s>",'')
    str = str.replace('</s>','')
    return str
    
#Parse the XML of julius analysis results
def parse_recogout(xml_data):
    #Get the word of the recognition result
    #Save results in dictionary
    word_list = []
    score_list = []
    xml_list = {} 
    for i in xml_data.findall(".//WHYPO"):
        word = i.get("WORD") 
        score = i.get("CM")
        if ("[s]" in word) == False:
            word_list.append(word)
            score_list.append(score)
    xml_list = dict(izip(word_list, score_list))
    return xml_list
It's a little long, but the whole thing from 1.1 to 1.3 looks like this.
#Extract the specified tag from the data obtained from julius
def extract_xml(tag_name, xml_in, xml_buff, line):
    xml = False
    final = False
    if line.startswith("<RECOGOUT>"):
        xml = True
        xml_buff = line
    elif line.startswith("</RECOGOUT>"):
        xml_buff += line 
        final = True
    else:
        if xml_in:
            xml_buff += escape(line) 
            xml = True
                
    return xml,xml_buff,final
# <s>Removed tags (corresponding because an error occurred during XML parsing)
def escape(line):
    str = line.replace("<s>",'')
    str = str.replace('</s>','')
    return str
    
#Parse the XML of julius analysis results
def parse_recogout(xml_data):
    #Get the word of the recognition result
    #Save results in dictionary
    word_list = []
    score_list = []
    xml_list = {} 
    for i in xml_data.findall(".//WHYPO"):
        word = i.get("WORD") 
        score = i.get("CM")
        if ("[s]" in word) == False:
            word_list.append(word)
            score_list.append(score)
    xml_list = dict(izip(word_list, score_list))
    return xml_list
#Judge and process voice
def decision_word(xml_list):
    watson = False
    for key, value in xml_list.items():
        if u"Raspberry pi" == key:
            print u"Yes. What is it?"
        if u"Watson" == key:
            print u"Roger that. prepare."
            watson = True
    return watson
#Julius server
JULIUS_HOST = "localhost"
JULIUS_PORT = 10500
#Start julius server
def invoke_julius():
    logging.debug("invoke_julius")
    # -Prohibit logging with the nolog option
    #Soon,-Output the log to a file with the logfile option etc.
    reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"]
    p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None)
    time.sleep(3.0)
    return p
#disconnect julius server
def kill_process(julius):
    logging.debug("kill_process")
    julius.kill()
    time.sleep(3.0)
#Connect with julius
def create_socket():
    logging.debug("create_socket")
    # TCP/Connect to julius with IP
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((JULIUS_HOST, JULIUS_PORT))
    sock_file = sock.makefile()
    return sock
#Close connection with julius
def close_socket(sock):
    logging.debug("close_socket")
    sock.close()
#Main processing
def main():
    #Start julius server
    julius = invoke_julius()
    #Connect to julius
    sock = create_socket()
    julius_listening = True
    bufsize = 4096
    xml_buff = ""
    xml_in = False
    xml_final = False
    watson = False
    while julius_listening:            
        #Get analysis result from julius
        data = cStringIO.StringIO(sock.recv(bufsize))
        #Get one line from the analysis result
        line = data.readline()
        while line:
            #Only the line showing the speech analysis result is extracted and processed.
            #Extract and process only the RECOGOUT tag.
            xml_in, xml_buff, xml_final = extract_xml('RECOGOUT', xml_in, xml_buff, line)
            if xml_final:
                #Analyze mxl
                logging.debug(xml_buff)
                xml_data = fromstring(xml_buff)
                watson = decision_word( parse_recogout(xml_data))
                xml_final = False
                #If the result is "Watson", go to voice authentication
                if watson:
                    julius_listening = False #Julius finished
                    break
            #Get one line from the analysis result
            line = data.readline()
    #Close socket
    close_socket(sock)
    #Disconnect julius
    kill_process(julius)← Watson's voice authentication "Speech to text" records using arecord, so Julius disconnects (because the microphone device collides, ...)
    if watson:
        speechToText()← If you are told "Watson", execute the processes ③ and ④
def initial_setting():
    #Log settings
    logging.basicConfig(filename='websocket_julius2.log', filemode='w', level=logging.DEBUG)
    logging.debug("initial_setting")
if __name__ == "__main__":
    try:
        #Initialization process
        initial_setting()
        #Main processing
        main()
    except Exception as e:
        print "error occurred", e, traceback.format_exc()
    finally:
        print "websocket_julius2...end"
Start the voice recording process (execute the arecord command) in multithreading. Binary data will be sent to watson each time it is recorded so that the voice can be converted to text in real time. (* .Data exchange to watson will be described later)
def opened(self):
    self.stream_audio_thread = threading.Thread(target=self.stream_audio)
    self.stream_audio_thread.start() 
#Start recording process
def stream_audio(self):
    # -Hide message with q option
    reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"]
    p = subprocess.Popen(reccmd,stdout=subprocess.PIPE)
    print 'Ready. Please voice'
    while self.listening:
        data = p.stdout.read(1024)
        try: 
            self.send(bytearray(data), binary=True)← Pass binary data to watson
        except ssl.SSLError: pass
Use the webSocket version of Speech to Text to convert voice to text in real time. For Speech to text, please also refer to I tried Watson Speech to Text.
Implemented with reference to this sample source Getting robots to listen: Using Watson’s Speech to Text service
Connect to watson using the library for watson (watson-developer-cloud-0.23.0)
class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
        username = "XXXXXXX"
        password = "XXXXXXX"
        auth_string = "%s:%s" % (username, password)
        base64string = base64.encodestring(auth_string).replace("\n", "")
        self.listening = False
        try:
            WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)])
            self.connect()
        except: print "Failed to open WebSocket."
    # websocket(Connect)
    def opened(self):
        self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}')
The execution result (voice data) of the arecord command executed in the multithread described above is sent to watson. It's a little long, but ... 2. Voice recording-3. When I put together the connection from Raspberry Pi 3 to watson (Speech to Text), it looks like this.
class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
        username = "XXXXXXX"
        password = "XXXXXXX"
        auth_string = "%s:%s" % (username, password)
        base64string = base64.encodestring(auth_string).replace("\n", "")
        self.listening = False
        try:
            WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)])
            self.connect()
        except: print "Failed to open WebSocket."
    # websocket(Connect)
    def opened(self):
        self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}')
        self.stream_audio_thread = threading.Thread(target=self.stream_audio)
        self.stream_audio_thread.start() 
        
    #Start recording process
    def stream_audio(self):
        while not self.listening:
            time.sleep(0.1)
        # -Hide message with q option
        reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"]
        p = subprocess.Popen(reccmd,stdout=subprocess.PIPE)
        print 'Ready. Please voice'
        while self.listening:
            data = p.stdout.read(1024)
            try: 
                self.send(bytearray(data), binary=True)
            except ssl.SSLError: pass
When connecting with webSocket, it seems that the analysis result from watson can be received in the received_message event.
    # websockt(Receive message)
    def received_message(self, message):
        print message 
The analysis result seems to be returned as a json object.
With this kind of feeling, I was able to convert the voice into text in real time.
 
2017/4/16 postscript I made a video like this. https://youtu.be/IvWaHISF6nY
Impression that voice cannot be authenticated well when talking with multiple people or when there is music. Still, I thought it was simply amazing that the voice became text in real time. I want to play more and more with voice authentication.
Recommended Posts