In Exercise 3 of Last time, we practiced the components of the "video analysis" system. This time, we will explain the components of the voice analysis system using the "level meter" as the subject. The components of the voice analysis system are almost the same as those of the video analysis system. In video analysis, the data to be analyzed is given by the function new_video_frame (), whereas in voice analysis, the function new_audio_frame () The difference is that it is given by (: //www.remotte.jp/ja/user_guide/program/functions).
Create a new app as you did in the previous exercises.
Select "Media" from the menu at the bottom left of the "Configuration" screen, and add an element to be used as voice input.
Once you've added the audio source, it's time to add the audio analysis components. As "input / output type" Voice analysis (value output only) Voice analysis (voice output only) Voice analysis (outputs both value and voice) In this level meter exercise, you only need to find the level of the stereo sound (maximum amplitude in the frame) that was input as the output after the analysis. For example, you can move the sound itself up or down an octave. Since the audio without noise is not output, select "Audio output (value output only)". Also, in order to output the left and right maximum levels as the analysis result, select "general binary sense" as the "compatible type". Next, write the Python code on the "Code" screen. In speech analysis, functions new_audio_frame(self, audio_frame, time_sec) The voice data to be analyzed is notified to the Python side, the analysis process is performed in this function, and the result is notified to the platform side. Here, the audio data is stored in the argument audio_frame as numpy.ndarray type, and the elapsed time (seconds) from the start of the application is stored in float type in time_sec. The audio data is 16 bits, 2 channels, the sampling rate is 48 kHz, and the function new_audio_frame () is called every 20 milliseconds. In other words, this function is called 50 times per second, and the audio_frame stores 960 stereo amplitude values for each channel. For reference, at the beginning of the function new_audio_frame (), print(type(audio_frame), len(audio_frame), type(audio_frame[0])) When the application is executed with the description, the following is output to the console. <class 'numpy.ndarray'> 1920 <class 'numpy.int16'> Similarly, at the beginning of the function, print(time_sec) When executed, 50 time information items are output every second.
Attention! As a platform specification, audio and video analysis components are not displayed on the console when print () is performed inside the function \ _ \ _ init \ _ \ _ (). </ font>
Since the load is too heavy to display the maximum value of the voice level on the browser screen 50 times per second, set this to 1/10 and put in a control to update the data only 10 times per second. Enter the following as the source code in case of.
python
def __init__(self, sys, opt, log):
self._sys = sys
self._opt = opt
self._log = log
self._count = 0
def new_audio_frame(self, audio_frame, time_sec):
if self._count == 0:
left = audio_frame[0::2]
right = audio_frame[1::2]
max_left = int(left.max() / 32767 * 100)
max_right = int(right.max() / 32767 * 100)
self._sys.set_value({'value': [max_left, max_right]})
self._count += 1
if self._count == 5:
self._count = 0
Every 0.1 seconds, the maximum value of each level on the left and right is scaled to the range from 0 to 100 and output.
On the "Display item" screen, set as follows. It should be noted here that the "number of displays" is set to 4 as the latest value of the "level meter". Next, select the "Layout" screen. As shown below, by default, the expression format of "two numerical display" is set for all four display items. As you can see from the Python code above, as an analysis result [Maximum value on the left, maximum value on the right] Is output, but there is a method to extract the left value and the right value. There is a group called "Advanced Settings" in the option settings on the right side of the screen. By setting an integer value to the option "Extract from array" in this, the element at an arbitrary position can be extracted from the array data. For example, in the example of this app, if you enter "0", if the key name is the value of "value" and it is an array, you can get the element with the index "0", that is, "left". You can get the maximum value of. Similarly, if "1" is set, the "maximum value on the right" can be obtained. Let's use this option to create a layout like the one below. In other words, two display items are used for each of the left level and the right level, one is displayed with a value from 0 to 100 using the expression format "display of one numerical value", and the other is displayed. Uses a "step meter (horizontal, colored area)" to display graphically.
When you run the app and make a loud voice into the microphone, the level swings greatly.
In this exercise, we experienced the components of the "voice analysis" system. I want to learn to analyze audio data given by the platform by the function new_audio_frame (). Although not used in this exercise, if the analysis result outputs a voice different from the input voice, the set_audio_frame () function is used to notify the platform side of the voice data.
So far, in four exercises, I explained about application development using Remotte with the approach of "getting used to it rather than learning". Next time, I will organize the knowledge I have acquired so far as programming technical information.
Recommended Posts