I recently got into American football (NFL). However, I can't understand English. .. .. Even if you don't understand the voice, can you somehow decipher it by writing it? Thinking, I challenged to make voice text of player interviews with Raspberry Pi 3 × Julius × Watson (Speech to Text)
The image looks like this Getting robots to listen: Using Watson’s Speech to Text service
The following is assumed to be ready. For reference, list the link of the site that I referred to
--Enable the microphone on Raspberry Pi 3 -Easy to do! Conversation with Raspberry pi using speech recognition and speech synthesis -Try voice recognition and voice synthesis with Raspberry Pi 2 --Julius installation on Raspberry Pi 3 -Voice recognition by Julius-Utilization of domestic open source library --User registration to watson (It seems that all services can be used free of charge for one month after registration)
Julius seems to have a reading file and a grammar file to speed up authentication. After trying both, I decided to use a grammar file this time.
Please refer to Raspberry Pi 3 x Julius (reading file and grammar file) for the verification result.
If you start Julius in module mode (*), the audio will be returned in XML. If you say "Start Watson", you will get the following XML.
<RECOGOUT>
<SHYPO RANK="1" SCORE="-2903.453613" GRAM="0">
<WHYPO WORD="Watson" CLASSID="Watson" PHONE="silB w a t o s o N silE" CM="0.791"/>
</SHYPO>
</RECOGOUT>
<RECOGOUT>
<SHYPO RANK="1" SCORE="-8478.763672" GRAM="0">
<WHYPO WORD="Watson started" CLASSID="Watson started" PHONE="silB w a t o s o N k a i sh i silE" CM="1.000"/>
</SHYPO>
</RECOGOUT>
Therefore, for the spoken word, parse the XML and describe the process to be executed. (It's not good, but it's solid ...)
#Judge and process voice
def decision_word(xml_list):
watson = False
for key, value in xml_list.items():
if u"Raspberry pi" == key:
print u"Yes. What is it?"
if u"Watson" == key:
print u"Roger that. prepare."
watson = True
return watson
The Julius server is now started as a subprocess
#Start julius server
def invoke_julius():
logging.debug("invoke_julius")
# -Prohibit log output with the nolog option
reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"]
p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None)
time.sleep(3.0)
return p
#Julius server
JULIUS_HOST = "localhost"
JULIUS_PORT = 10500
#Connect with julius
def create_socket():
logging.debug("create_socket")
# TCP/Connect to julius with IP
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((JULIUS_HOST, JULIUS_PORT))
sock_file = sock.makefile()
return sock
As mentioned above, XML is returned from Julius, so get the tag, an error will occur during XML parsing, so processing other than </ s> is included.
#Extract the specified tag from the data obtained from julius
def extract_xml(tag_name, xml_in, xml_buff, line):
xml = False
final = False
if line.startswith("<RECOGOUT>"):
xml = True
xml_buff = line
elif line.startswith("</RECOGOUT>"):
xml_buff += line
final = True
else:
if xml_in:
xml_buff += escape(line)
xml = True
return xml,xml_buff,final
# <s>Removed tags (corresponding because an error occurred during XML parsing)
def escape(line):
str = line.replace("<s>",'')
str = str.replace('</s>','')
return str
#Parse the XML of julius analysis results
def parse_recogout(xml_data):
#Get the word of the recognition result
#Save results in dictionary
word_list = []
score_list = []
xml_list = {}
for i in xml_data.findall(".//WHYPO"):
word = i.get("WORD")
score = i.get("CM")
if ("[s]" in word) == False:
word_list.append(word)
score_list.append(score)
xml_list = dict(izip(word_list, score_list))
return xml_list
It's a little long, but the whole thing from 1.1 to 1.3 looks like this.
#Extract the specified tag from the data obtained from julius
def extract_xml(tag_name, xml_in, xml_buff, line):
xml = False
final = False
if line.startswith("<RECOGOUT>"):
xml = True
xml_buff = line
elif line.startswith("</RECOGOUT>"):
xml_buff += line
final = True
else:
if xml_in:
xml_buff += escape(line)
xml = True
return xml,xml_buff,final
# <s>Removed tags (corresponding because an error occurred during XML parsing)
def escape(line):
str = line.replace("<s>",'')
str = str.replace('</s>','')
return str
#Parse the XML of julius analysis results
def parse_recogout(xml_data):
#Get the word of the recognition result
#Save results in dictionary
word_list = []
score_list = []
xml_list = {}
for i in xml_data.findall(".//WHYPO"):
word = i.get("WORD")
score = i.get("CM")
if ("[s]" in word) == False:
word_list.append(word)
score_list.append(score)
xml_list = dict(izip(word_list, score_list))
return xml_list
#Judge and process voice
def decision_word(xml_list):
watson = False
for key, value in xml_list.items():
if u"Raspberry pi" == key:
print u"Yes. What is it?"
if u"Watson" == key:
print u"Roger that. prepare."
watson = True
return watson
#Julius server
JULIUS_HOST = "localhost"
JULIUS_PORT = 10500
#Start julius server
def invoke_julius():
logging.debug("invoke_julius")
# -Prohibit logging with the nolog option
#Soon,-Output the log to a file with the logfile option etc.
reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"]
p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None)
time.sleep(3.0)
return p
#disconnect julius server
def kill_process(julius):
logging.debug("kill_process")
julius.kill()
time.sleep(3.0)
#Connect with julius
def create_socket():
logging.debug("create_socket")
# TCP/Connect to julius with IP
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((JULIUS_HOST, JULIUS_PORT))
sock_file = sock.makefile()
return sock
#Close connection with julius
def close_socket(sock):
logging.debug("close_socket")
sock.close()
#Main processing
def main():
#Start julius server
julius = invoke_julius()
#Connect to julius
sock = create_socket()
julius_listening = True
bufsize = 4096
xml_buff = ""
xml_in = False
xml_final = False
watson = False
while julius_listening:
#Get analysis result from julius
data = cStringIO.StringIO(sock.recv(bufsize))
#Get one line from the analysis result
line = data.readline()
while line:
#Only the line showing the speech analysis result is extracted and processed.
#Extract and process only the RECOGOUT tag.
xml_in, xml_buff, xml_final = extract_xml('RECOGOUT', xml_in, xml_buff, line)
if xml_final:
#Analyze mxl
logging.debug(xml_buff)
xml_data = fromstring(xml_buff)
watson = decision_word( parse_recogout(xml_data))
xml_final = False
#If the result is "Watson", go to voice authentication
if watson:
julius_listening = False #Julius finished
break
#Get one line from the analysis result
line = data.readline()
#Close socket
close_socket(sock)
#Disconnect julius
kill_process(julius)← Watson's voice authentication "Speech to text" records using arecord, so Julius disconnects (because the microphone device collides, ...)
if watson:
speechToText()← If you are told "Watson", execute the processes ③ and ④
def initial_setting():
#Log settings
logging.basicConfig(filename='websocket_julius2.log', filemode='w', level=logging.DEBUG)
logging.debug("initial_setting")
if __name__ == "__main__":
try:
#Initialization process
initial_setting()
#Main processing
main()
except Exception as e:
print "error occurred", e, traceback.format_exc()
finally:
print "websocket_julius2...end"
Start the voice recording process (execute the arecord command) in multithreading. Binary data will be sent to watson each time it is recorded so that the voice can be converted to text in real time. (* .Data exchange to watson will be described later)
def opened(self):
self.stream_audio_thread = threading.Thread(target=self.stream_audio)
self.stream_audio_thread.start()
#Start recording process
def stream_audio(self):
# -Hide message with q option
reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"]
p = subprocess.Popen(reccmd,stdout=subprocess.PIPE)
print 'Ready. Please voice'
while self.listening:
data = p.stdout.read(1024)
try:
self.send(bytearray(data), binary=True)← Pass binary data to watson
except ssl.SSLError: pass
Use the webSocket version of Speech to Text to convert voice to text in real time. For Speech to text, please also refer to I tried Watson Speech to Text.
Implemented with reference to this sample source Getting robots to listen: Using Watson’s Speech to Text service
Connect to watson using the library for watson (watson-developer-cloud-0.23.0)
class SpeechToTextClient(WebSocketClient):
def __init__(self):
ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
username = "XXXXXXX"
password = "XXXXXXX"
auth_string = "%s:%s" % (username, password)
base64string = base64.encodestring(auth_string).replace("\n", "")
self.listening = False
try:
WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)])
self.connect()
except: print "Failed to open WebSocket."
# websocket(Connect)
def opened(self):
self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}')
The execution result (voice data) of the arecord command executed in the multithread described above is sent to watson. It's a little long, but ... 2. Voice recording-3. When I put together the connection from Raspberry Pi 3 to watson (Speech to Text), it looks like this.
class SpeechToTextClient(WebSocketClient):
def __init__(self):
ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
username = "XXXXXXX"
password = "XXXXXXX"
auth_string = "%s:%s" % (username, password)
base64string = base64.encodestring(auth_string).replace("\n", "")
self.listening = False
try:
WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)])
self.connect()
except: print "Failed to open WebSocket."
# websocket(Connect)
def opened(self):
self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}')
self.stream_audio_thread = threading.Thread(target=self.stream_audio)
self.stream_audio_thread.start()
#Start recording process
def stream_audio(self):
while not self.listening:
time.sleep(0.1)
# -Hide message with q option
reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"]
p = subprocess.Popen(reccmd,stdout=subprocess.PIPE)
print 'Ready. Please voice'
while self.listening:
data = p.stdout.read(1024)
try:
self.send(bytearray(data), binary=True)
except ssl.SSLError: pass
When connecting with webSocket, it seems that the analysis result from watson can be received in the received_message event.
# websockt(Receive message)
def received_message(self, message):
print message
The analysis result seems to be returned as a json object.
With this kind of feeling, I was able to convert the voice into text in real time.
2017/4/16 postscript I made a video like this. https://youtu.be/IvWaHISF6nY
Impression that voice cannot be authenticated well when talking with multiple people or when there is music. Still, I thought it was simply amazing that the voice became text in real time. I want to play more and more with voice authentication.
Recommended Posts