Re-challenge the second AI development contest "Neural Network Console Challenge" planned by Sony and Ledge. By analyzing Audiostock's audio (BGM) data and bibliography (song description), we will work on the ** free task "Create a player that automatically selects BGM according to the content of daily conversation" **. Consider a system in which smart speakers such as Google Home automatically play BGM according to the conversation content of the person in the room (although the hurdle for practical use seems to be high from the viewpoint of privacy ...).
・ Google Colaboratory (python3) ・ Neural Network Console (Windows version)
Work No. | Data name | One line explanation | tag |
---|---|---|---|
42554 | audiostock_42554.wav | The best song for the opening | opening |
42555 | audiostock_42555.wav | It's a bossa nova song | Bossa Nova |
42556 | audiostock_42556.wav | Heartwarming Easy Listening Comical | comical,cute,warm,Heartwarming,easy listening |
42557 | audiostock_42557.wav | It's a strange song | Strange time signature |
Using BGM voice data (WAV), create a model for automatic classification using NNC. Due to time constraints, we have built a model that can be classified into 3 classes this time.
To investigate what kind of class is desirable, first investigate the words included in the above "one-line explanation" using "KHcoder" that can statistically analyze text data. The top results are as follows. From these, it seems that you can classify while actually listening to BGM (tempo, tone, etc. are different) To use songs that include any of "rock", "pop", and "ballad" as learning data. We created 1468 learning data and 105 evaluation data. In addition, sound sources (jingles) such as sound effects were excluded from the creation because the length of the song is short.
We will convert the WAV data of BGM into a mel frequency cepstrum coefficient and drop it into a 40-dimensional vector (details are omitted, but this page / 34161f2facb80edd999f)). The average was taken for each pitch on the vertical axis and made into an array (1,40), which was used as learning data.
Wav_to_Mel.py
import pandas as pd
import numpy as np
import librosa
y, sr = librosa.load(file_name)
#Feature extraction in 40 dimensions
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
#Calculate the average on the vertical axis and output
S_A = np.mean(mfcc, axis = 1)
np.savetxt(output_filename, S_A.reshape(1, -1), delimiter=',', fmt="%s")
I trained a model that classifies vectors with NNC. It seems that the solution method with CNN is common, but after trying various networks and activation functions, the result is that the following settings are the most accurate. If I have time, I would like to repeat the experiment. By the way, the advantage of NNC is that it is very easy because the GUI is prepared for trial and error such as changing the function. You can intuitively understand what kind of network it is, and I think it is one of the attractions when compared with Google Colab. Since I trained low-dimensional vectors, the amount of processing was sufficient with the CPU (Windows version) this time, but the learning results with the cloud version learned with almost the same settings will be published for the time being. After 30 epochs were trained, the learning curve was as follows (Best Validation was the 9th epoch). Next, using the created model, evaluate it with test data and try to measure the accuracy. Although it is a three-classification problem, Accuracy is 0.8, which seems to have some characteristics. The average precision rate is about 80% or more, and it seems to be a valuable model for the task of selecting suitable BGM.
Utilizing Bert's trained model, select BGM that is close to the conversation content from the one-line explanation. Calculates the conversation (text) vector and selects the BGM with the closest one-line explanation in terms of cosine similarity. First, find a suitable BGM based on the text, and then 3. We will select songs based on the classification based on the BGM created in. I couldn't think of an implementation of BERT in NNC, so I processed it with Google colab and transformers who have knowledge (Personally, NNC has a lot of image fields, so next time I will strengthen around natural language. I'm happy with my work).
Conversation_to_BGM.py
import pandas as pd
import numpy as np
import torch
import transformers
from transformers import BertJapaneseTokenizer
from tqdm import tqdm
tqdm.pandas()
class BertSequenceVectorizer:
def __init__(self):
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
self.model_name = 'cl-tohoku/bert-base-japanese-whole-word-masking'
self.tokenizer = BertJapaneseTokenizer.from_pretrained(self.model_name)
self.bert_model = transformers.BertModel.from_pretrained(self.model_name)
self.bert_model = self.bert_model.to(self.device)
self.max_len = 128
def vectorize(self, sentence : str) -> np.array:
inp = self.tokenizer.encode(sentence)
len_inp = len(inp)
if len_inp >= self.max_len:
inputs = inp[:self.max_len]
masks = [1] * self.max_len
else:
inputs = inp + [0] * (self.max_len - len_inp)
masks = [1] * len_inp + [0] * (self.max_len - len_inp)
inputs_tensor = torch.tensor([inputs], dtype=torch.long).to(self.device)
masks_tensor = torch.tensor([masks], dtype=torch.long).to(self.device)
seq_out, pooled_out = self.bert_model(inputs_tensor, masks_tensor)
if torch.cuda.is_available():
return seq_out[0][0].cpu().detach().numpy()
else:
return seq_out[0][0].detach().numpy()
if __name__ == '__main__':
#Reading the original data
df_org = pd.read_csv('./drive/NNC/BGM data list.csv')
#Focus only on the songs in the learning data
df_org = df_org.dropna(subset=["One line explanation"])
df_org = df_org[~df_org['One line explanation'].str.contains("Jingle")]
df_org = df_org[~df_org['tag'].str.contains("Jingle")]
df_org = df_org.head(5000)
word = ["Lock", "pop", "Ballad"]
df = df_org.iloc[0:0]
for w in word:
df_detect = df_org[df_org["One line explanation"].str.contains(w)]
df = pd.concat([df, df_detect])
df = df.reset_index(drop=True)
BSV = BertSequenceVectorizer()
#Calculate feature vector from bibliography
df['text_feature'] = df['One line explanation'].progress_apply(lambda x: BSV.vectorize(x))
#Search for similar vector (BGM) from input text
nn = NearestNeighbors(metric='cosine')
nn.fit(df["text_feature"].values.tolist())
vec = BSV.vectorize("Good morning. It's nice weather today. Yeah. It looks like it's sunny all day.")
##Calculate cosine similarity
dists, result = nn.kneighbors([vec], n_neighbors=1)
print(df["Data name"][r], df["One line explanation"][r])
###Output result
audiostock_45838.wav
Name:Data name, dtype: object 188
Busy but fun pop/Lock
Name:One line explanation, dtype: object
Now, let's try what kind of song will be selected by inputting the expected conversational sentence. The final BGM selected was 300 songs that were not used for learning and evaluation, and did not include the words "rock", "pop", and "ballad" in the one-line explanation. The whole picture is as shown in the figure. For the songs to be finally selected, we decided to play the songs with the highest prediction probability in order from the file "output_result.csv" output by NNC (NNC can set different data for the evaluation at the time of learning and the final evaluation. ). Let's select songs in various cases.
** Case 1) ** ** ◆ Conversation: ** Good morning. It's nice weather today. Yeah. It looks like it's sunny all day. ** ◆ One line with high similarity Description: ** Busy and fun pop / rock (audiostock_45838.wav) → Label "Rock"
Whether it's a morning song or not, I was able to choose a song that seems to be energetic, such as "powerful" and "lively"! There seems to be a tendency for metal-style songs that use electric guitars to be selected.
*** Case 2) *** ** ◆ Conversation: ** We plan to camp in Yamanashi on weekends. You can spend a quiet time by the lake for the first time in a while. It's getting cooler, so be careful. ** ◆ One line with high similarity Description: ** Pop (audiostock_45997.wav) that suddenly misses the love of parents on a distant day → Label "Pop" ** ◆ Song selection result: ** ・ Audiostock_45254 Pure Japanese music with a ghostly story that freezes your spine ・ Audiostock_44771 BGM of horror document touch ・ Audiostock_46760 Travel information Nostalgic melancholy lonely twilight ・ Audiostock_46657 Refreshing desired drive Light forward ・ Audiostock_44331 Heartwarming music from the tropical Caribbean
The 1st and 2nd songs are obviously bad selection results (ghost story ...), but the 4th and 5th songs are pop BGMs that are perfect for traveling. Also, the explanation of the third song has a sad atmosphere, but it was a BGM with "pop" in the tag, and when I actually listened to it, it was not that dark. From this, it can be said that there is a tendency to automatically select pop songs.
*** Case 3) *** ** ◆ Conversation: ** I heard that drama, moving thing, did you see it? .. A sad and sad story. The last cried. ** ◆ One line with high similarity Description: ** Warming ballads, teen feelings (audiostock_43810.wav) → Label "ballads" ** ◆ Song selection result: ** ・ Audiostock_46013 Fresh, mysterious and spacious environment ・ Audiostock_44891 Relaxation ambient of night stars ・ Audiostock_44575 A gentle ambi-style sound that expands the world of fairy tales ・ Audiostock_45599 A mysterious environment with a cool morning atmosphere ・ Audiostock_45452 A graceful classic with artistic elegance in the garden
I have been able to successfully extract quiet BGM such as mysterious and loose ballad songs and classical music.
All three categories were automatically selected based only on the BGM features, but it seems that we were able to extract songs that were almost as intended! Looking at the BGM that was not selected (the probability of prediction was low), "Variety program title BGM" (audiostock_43840), "Latin-flavored Euro-house style" (audiostock_42921), "Multinational African mysterious travelogue fashionable" (Audiostock_46146) etc., and it was confirmed that the model has a distinction of unsuitable BGM.
For the task of creating a player that automatically selects BGM according to the content of daily conversation
I was able to realize that. This time, I was only able to create a model with a total of about 1600 songs and small learning data, but by further examining the annotation and the number of data, further improvement in accuracy can be expected, and 3 or more classification classes can be created. Should do. There seems to be more room for research on how to calculate the features of BGM. The service proposal was based on the assumption of smart speakers, but it is not limited to this, but it can be used for future proposals such as posting songs from tags and texts on SNS, and automatically selecting BGM from subtitle data in video editing. It seems to be.
Recommended Posts