This is a memo of the result of various investigations to see if it is possible to easily send audio to the server from the microphone connected to the PC.
It's for a little tool for internal use, so it's a specification that "just use the latest version of Google Chrome for the time being".
It's almost my own memorandum, and I have little knowledge of JavaScript. Python is often used on the server side and for data analysis.
Perhaps there is a good JS library for voice communication, and the work described below may be completed in one shot. If you know, let us know in the comments!
This is Google's article for web developers "Get audio data from users" Was helpful.
<script>
var handleSuccess = function(stream) {
var context = new AudioContext();
var input = context.createMediaStreamSource(stream)
var processor = context.createScriptProcessor(1024, 1, 1);
//WebSocket connection
var connection = new WebSocket('wss://hogehoge.com:8000/websocket');
input.connect(processor);
processor.connect(context.destination);
processor.onaudioprocess = function(e) {
var voice = e.inputBuffer.getChannelData(0);
connection.send(voice.buffer); //Send by websocket
};
};
navigator.mediaDevices.getUserMedia({ audio: true, video: false })
.then(handleSuccess)
</script>
There seem to be three ways to send audio to the server:
Here, Qiita's article "Technology to send from server to client-Focusing on WebSocket" and "Technology introduction of WebSocket / WebRTC ”was helpful.
I decided to continue sending voice buffers with websocket for the time being, saying, "It's not good to send it frequently by http communication, and it seems heavy to use WebRTC even though bidirectional voice communication is not necessary." did. It seems that WebSocket can send and receive binaries or strings.
For more information, please read this document.
The size of the buffer in units of sample frames. If specified, it must be one of the following values: 256, 512, 1024, 2048, 4096, 8192, 16384. If not specified, or 0 is specified, the optimum value for the environment is set. This value will be used as long as the node survives, and its value is a power of 2.
This value determines how often audioprocess events occur and the size of the sample frame passed for each event. Specifying a small value results in low latency, and specifying a large value avoids audio corruption and glitches. It is recommended that you do not decide this value yourself, but let the implementation decide it for delay and quality.
I don't know the exact meaning of the word "allow the implementation to pick a good buffer size", but should I set it to 0?
It is also discussed in Stack Overflow (2013).
Also read the Documentation. We are dealing with monaural sound here, and if you need stereo sound, it seems that you need to devise it.
When you call buffer, audio data will be returned as a Float32Array type, that is, a real number in the range of -1 to + 1. It should be noted that WAV files are represented by 16-bit signed integers (though there seem to be several formats), that is, values between -32768 and 32767.
I used Python 3.6, which I'm used to.
I used Tornado, a lightweight web framework that is good at asynchronous processing of Python.
import tornado.ioloop
import tornado.web
import tornado.websocket
import wave
import numpy as np
SAMPLE_SIZE = 2
SAMPLE_RATE = 48000
PATH = '/path/to/output.wav'
class WebSocketHandler(tornado.websocket.WebSocketHandler):
def open(self):
self.voice = []
print("opened")
def on_message(self, message):
self.voice.append(np.frombuffer(message, dtype='float32'))
def on_close(self):
v = np.array(self.voice)
v.flatten()
#Convert to binary to 16-bit integer and save
arr = (v * 32767).astype(np.int16)
with wave.open(PATH, 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(SAMPLE_SIZE)
wf.setframerate(SAMPLE_RATE)
wf.writeframes(arr.tobytes('C'))
self.voice.clear()
print("closed")
app = tornado.web.Application([
(r"/websocket", WebSocketHandler)
])
if __name__ == "__main__":
app.listen(8000)
tornado.ioloop.IOLoop.instance().start()
The documentation for a WebSocket server using Tornado is here.
Other frameworks such as Flask, Bottle, and Django are famous, but Tornado was adopted because it is good at asynchronous processing like node.js. Also, there seemed to be a lot of code that could be used as a reference compared to other frameworks.
With numpy, Python's library for numerical calculations, you can easily read the passed data and read it as an array of numpy.
Read the numpy.frombuffer reference (https://docs.scipy.org/doc/numpy/reference/generated/numpy.frombuffer.html) for more information.
It should be noted that WAV files (although there seem to be several formats) are represented by 16-bit signed integers, that is, values between -32768 and 32767.
Also, the sample rate value (48000) depends on the implementation on the JavaScript side, so check it with AudioContext.sampleRate. ..
I was in trouble because I didn't understand the specifications of the WAV data structure, but "Sound programming starting in C language-Signal processing of sound effects" I studied by reading the explanation of Chapter 1 of.
Well, after all, I managed to use the standard Python library wave.
that's all.
From here, you can send to the voice API of various cloud services, perform voice processing with Python, and so on. Also, I haven't made any certification, so I have to do it properly.
The reason for using WebSocket is that it would be interesting to be able to push from the server to the client (browser) and return the results of the voice analysis API and voice processing in real time.
Recommended Posts