Web drinking party that is popular because the world is the world.
However, after a certain number of meetings, there will always be a timing when no one speaks. In the first place, everyone stays at home all the time, so there isn't much talk about it. I'm sure some people think that they shouldn't participate in such a drinking party, but it's difficult because there is no reason to decline this.
No matter how close you are, the drinking party will be silent when there is no more content to talk about.
The drinking party itself will continue lazily, so if you feel uncomfortable, don't worry. Such a situation is the worst for whatever reason.
Therefore, this time, I would like to create a program in Python that detects the silence state and plays a voice in order to avoid the silence state ** (awkward state due to) ** in the Web drinking party. (It's a relief)
I usually use Zoom for meetings such as Web conferences, so I am aiming to use it with ** Zoom **.
However, in reality, the audio of the entire system will be monitored, so any software should probably be able to handle it.
The goal is to monitor Zoom's audio input for 10 seconds, and if it determines that there is no input = silence, the music file will be played randomly from the specified folder.
I started with an idea, so I use ** Python ** as the programming language, but it doesn't mean anything. The operating environment is as follows.
■Windows10 ■Python3.7
The libraries used are as follows.
import pyaudio
import numpy as np
import wave
import math
from mutagen.mp3 import MP3 as mp3
import pygame
import time
import glob
import random
import sys
I personally wanted to use Zoom with the best possible sound quality, so I have a separate audio interface and microphone (probably the audio is pretty cut on the Zoom side so it doesn't make much sense, I'm self-sufficient). ..
■marantz / AUDIO SCOPE SG-5BC ■CREATIVE / SB X-Fi Surround 5.1
Now let's write the program.
audio = pyaudio.PyAudio()
def system(FORMAT, CHANNELS, RATE, CHUNK):
stream = audio.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
output=True,
input_device_index=1,#← Please change to a suitable index.
output_device_index=7,#← Please change to a suitable index.
frames_per_buffer=CHUNK)
return stream
First, we instantiate and use pyaudio.PyAudio ()
to monitor the audio input by the microphone.
The input voice to be monitored is specified from the numerical value of input_device_index
.
If you don't know the device index value, you can look it up with the following code.
for index in range(0, p.get_device_count()):
print(p. get_device_info_by_index(index))
Taking my environment as an example, the output is as follows.
{'index': 0, 'structVersion': 2, 'name': 'Microsoft Sound Mapper- Input', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 1, 'structVersion': 2, 'name': 'Playback redirect(SB X-Fi Surround 5.1)', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 2, 'structVersion': 2, 'name': 'line(USB2.0 High-Speed True HD ', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 3, 'structVersion': 2, 'name': 'line/Microphone input(SB X-Fi Surround 5.1', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 4, 'structVersion': 2, 'name': 'SPDIF In (USB2.0 High-Speed Tru', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 5, 'structVersion': 2, 'name': 'Microphone(USB2.0 High-Speed True HD ', 'hostApi': 0, 'maxInputChannels': 2, 'maxOutputChannels': 0, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 6, 'structVersion': 2, 'name': 'Microsoft Sound Mapper- Output', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 2, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 7, 'structVersion': 2, 'name': 'speaker(SB X-Fi Surround 5.1)', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 6, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 8, 'structVersion': 2, 'name': 'SPDIF output(SB X-Fi Surround 5.1)', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 6, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 9, 'structVersion': 2, 'name': 'SPDIF Out (USB2.0 High-Speed Tr', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 2, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
{'index': 10, 'structVersion': 2, 'name': 'speaker(USB2.0 High-Speed True H', 'hostApi': 0, 'maxInputChannels': 0, 'maxOutputChannels': 8, 'defaultLowInputLatency': 0.09, 'defaultLowOutputLatency': 0.09, 'defaultHighInputLatency': 0.18, 'defaultHighOutputLatency': 0.18, 'defaultSampleRate': 44100.0}
In my environment, there are several audio interfaces connected, so there are a lot of indexes like this.
This time, it is necessary to read not only the input voice but also the state of the other party talking and the state of the shared screen with Zoom.
Therefore, in this case, the index used is playback redirection, 1 to monitor the entire system. Check this value yourself and substitute an appropriate value.
There is an item that specifies ʻoutput_device_index, but this is specified because you want to play an audio file in
wavformat. This item is not necessary especially if you do not plan to play the
wavfile. In the first place, it is purposely made into a function, so if you do not have a plan, you can specify
FORMAT, CHANNELS, RATE, CHUNK` without making it a function.
frames = []
def surveillance():
print("Under surveillance...")
FORMAT = pyaudio.paInt16
CHANNELS = 1 #monaural
RATE = 44100 #Sample rate
CHUNK = 2 ** 11 #Data score
RECORD_SECONDS = 10 #Length of time to record
stream = system(FORMAT, CHANNELS, RATE, CHUNK)
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
buf = stream.read(CHUNK)
data = np.frombuffer(buf, dtype="int16")
frames.append(max(data))
stream.stop_stream()
calculation()
The function name is "monitor".
Here, the voice is input to python for 10 seconds, and the ** maximum positive value ** of the voice waveform per second is extracted and added to frames
.
To explain a little, in this case, RATE = 44100
, so the sampling frequency is 44.1khz.
This means that we are getting 44100 volume levels per second.
Sound is a wave, and as long as it is a wave, it naturally includes negative values.
If you want to check the exact level every 1 / 44.1 seconds, you need to get the absolute value, but this time only the maximum value is saved because it is only necessary to judge whether there is a sound within 10 seconds.
The acquired value is converted to 2 ** 16 steps with a dynamic range of 16 bits by np.frombuffer
.
However, as mentioned earlier, it has positive and negative values, so the maximum value is 32767.
def calculation():
print("Calculation")
rms = (max(frames))
db = 20 * math.log10(rms) if rms > 0.0 else -math.inf
print(f"RMS:{format(db, '3.1f')}[dB]")
if (db<=65):#← Please adjust the numbers according to the environment
random_music()
#disc_jockey()
else:
pass
frames.clear()
Next is the function that determines the silence state.
Logarithmize the acquired values to make it easier to understand the level changes. Determine the threshold and branch at ʻif`. In my environment, about 65dB seems to be a good value. Change this value according to your own environment.
def random_music():
print("Random music")
files = [r.split('/')[-1] for r in glob.glob('./data/*.mp3')]
filename = random.choice(files) #The mp3 file you want to play
print(filename)
pygame.mixer.init()
pygame.mixer.music.load(filename) #Load the sound source
mp3_length = mp3(filename).info.length #Get the length of the sound source
pygame.mixer.music.play(1) #Playback starts. Play n times if part 1 is changed(In that case, also xn the number of seconds on the next line.)
time.sleep(mp3_length + 0.25) #After starting playback, wait for the length of the sound source(0.25 Waiting for error elimination)
pygame.mixer.music.stop() #Playback stops after waiting for the length of the sound source
Finally, it is a program that randomly extracts mp3
files from arbitrary folders and plays them.
In my case, I put it in data
directly under the source code directory.
try:
while True:
surveillance()
except KeyboardInterrupt:
print("Emergency stop")
sys.exit(0)
stream.close()
audio.terminate()
After that, the program is looped and monitored.
If it is urgent and the silence is broken, it ends with ctrl + c
.
def disc_jockey():
print("Play...")
filename = "./disc_jockey.wav"
wf = wave.open(filename, "rb")
FORMAT = audio.get_format_from_width(wf.getsampwidth())
CHANNELS = wf.getnchannels()
RATE = wf.getframerate()
CHUNK = wf.getnframes()
stream = system(FORMAT, CHANNELS, RATE, CHUNK)
data = wf.readframes(CHUNK)
stream.write(data)
stream.start_stream()
stream.stop_stream()
stream.close()
random_music()
By the way, I mentioned at the beginning that I want to play wav
as well, so I will describe how to do it.
It is basically the same as the recording method, but in this case it is necessary to determine the value according to the state of the file, so each value is acquired by wave
and substituted.
The function name is DJ because ** the other party doesn't make sense even if the music suddenly plays **, so I created it here with the intention of inserting a song introduction voice.
And the data I had was ** Chris 〇 Puller **'s "Let's go to the song here" and ** Ioin Hikaru **'s voice wav
file imitating. Now this.
... I don't know why.
I couldn't go outside because of a certain virus, and the silent web drinking party is still going on, but let's all do our best.
Recommended Posts