This is a memo of the result of various investigations to see if it is possible to easily send audio to the server from the microphone connected to the PC.

It's for a little tool for internal use, so it's a specification that "just use the latest version of Google Chrome for the time being".

It's almost my own memorandum, and I have little knowledge of JavaScript. Python is often used on the server side and for data analysis.

Perhaps there is a good JS library for voice communication, and the work described below may be completed in one shot. If you know, let us know in the comments!

Get audio with browser and send with websocket

This is Google's article for web developers "Get audio data from users" Was helpful.

<script>  
  var handleSuccess = function(stream) {
    var context = new AudioContext();
    var input = context.createMediaStreamSource(stream)
    var processor = context.createScriptProcessor(1024, 1, 1);

    //WebSocket connection
    var connection = new WebSocket('wss://hogehoge.com:8000/websocket');

    input.connect(processor);
    processor.connect(context.destination);

    processor.onaudioprocess = function(e) {
      var voice = e.inputBuffer.getChannelData(0);
      connection.send(voice.buffer); //Send by websocket
    };
  };

  navigator.mediaDevices.getUserMedia({ audio: true, video: false })
      .then(handleSuccess)
</script>

1. How to keep sending audio to the server

There seem to be three ways to send audio to the server:

Poll with http / https
Send intermittently via WebSocket communication (ws / wss)
Use WebRTC

Here, Qiita's article "Technology to send from server to client-Focusing on WebSocket" and "Technology introduction of WebSocket / WebRTC ”was helpful.

I decided to continue sending voice buffers with websocket for the time being, saying, "It's not good to send it frequently by http communication, and it seems heavy to use WebRTC even though bidirectional voice communication is not necessary." did. It seems that WebSocket can send and receive binaries or strings.

2. Use AudioContext.createScriptProcessor to determine the size of the audio to be sent at one time.

For more information, please read this document.

The size of the buffer in units of sample frames. If specified, it must be one of the following values: 256, 512, 1024, 2048, 4096, 8192, 16384. If not specified, or 0 is specified, the optimum value for the environment is set. This value will be used as long as the node survives, and its value is a power of 2.

This value determines how often audioprocess events occur and the size of the sample frame passed for each event. Specifying a small value results in low latency, and specifying a large value avoids audio corruption and glitches. It is recommended that you do not decide this value yourself, but let the implementation decide it for delay and quality.

I don't know the exact meaning of the word "allow the implementation to pick a good buffer size", but should I set it to 0?

It is also discussed in Stack Overflow (2013).

3. Get audio with AudioBuffer.getChannelData

Also read the Documentation. We are dealing with monaural sound here, and if you need stereo sound, it seems that you need to devise it.

When you call buffer, audio data will be returned as a Float32Array type, that is, a real number in the range of -1 to + 1. It should be noted that WAV files are represented by 16-bit signed integers (though there seem to be several formats), that is, values between -32768 and 32767.

Save as WAV file on server

I used Python 3.6, which I'm used to.

I used Tornado, a lightweight web framework that is good at asynchronous processing of Python.

import tornado.ioloop
import tornado.web
import tornado.websocket

import wave
import numpy as np


SAMPLE_SIZE = 2
SAMPLE_RATE = 48000
PATH = '/path/to/output.wav'


class WebSocketHandler(tornado.websocket.WebSocketHandler):
    def open(self):
        self.voice = []
        print("opened")

    def on_message(self, message):
        self.voice.append(np.frombuffer(message, dtype='float32'))

    def on_close(self):
        v = np.array(self.voice)
        v.flatten()

        #Convert to binary to 16-bit integer and save
        arr = (v * 32767).astype(np.int16)
        with wave.open(PATH, 'wb') as wf:
            wf.setnchannels(1)
            wf.setsampwidth(SAMPLE_SIZE)
            wf.setframerate(SAMPLE_RATE)
            wf.writeframes(arr.tobytes('C'))
        
        self.voice.clear()
        print("closed")


app = tornado.web.Application([
    (r"/websocket", WebSocketHandler)
])

if __name__ == "__main__":
    app.listen(8000)
    tornado.ioloop.IOLoop.instance().start()

1. About Tornado

The documentation for a WebSocket server using Tornado is here.

Other frameworks such as Flask, Bottle, and Django are famous, but Tornado was adopted because it is good at asynchronous processing like node.js. Also, there seemed to be a lot of code that could be used as a reference compared to other frameworks.

2. Easy to read binaries with numpy

With numpy, Python's library for numerical calculations, you can easily read the passed data and read it as an array of numpy.

Read the numpy.frombuffer reference (https://docs.scipy.org/doc/numpy/reference/generated/numpy.frombuffer.html) for more information.

3. About wav files

It should be noted that WAV files (although there seem to be several formats) are represented by 16-bit signed integers, that is, values between -32768 and 32767.

Also, the sample rate value (48000) depends on the implementation on the JavaScript side, so check it with AudioContext.sampleRate. ..

I was in trouble because I didn't understand the specifications of the WAV data structure, but "Sound programming starting in C language-Signal processing of sound effects" I studied by reading the explanation of Chapter 1 of.

Well, after all, I managed to use the standard Python library wave.

Summary

that's all.

From here, you can send to the voice API of various cloud services, perform voice processing with Python, and so on. Also, I haven't made any certification, so I have to do it properly.

The reason for using WebSocket is that it would be interesting to be able to push from the server to the client (browser) and return the results of the voice analysis API and voice processing in real time.

[JAVA] Save the audio data acquired by the browser in wav format on the server