This article was posted on the 22nd day of Fujitsu Systems Web Technology Advent Calendar 2019. The content of the article is the personal opinion, and the content of the writing is the responsibility of the author himself. The organization to which you belong does not matter. (Oyakusoku)
I like playing the violin as a hobby, but playing the violin is hard. It is widely known that it is difficult to play, thanks to Shizuka-chan, but there are other difficulties besides playing.
When I play the violin, I have a certain thing to do before I play the song. It's called Boeing, and it's a task of deciding whether to play with a raising bow or a lower bow one by one while looking at the score.
--V → Upbow (from bottom to top) --П → Down bow (from top to bottom)
I write Boeing on the score in consideration of the phrase feeling of the song and the ease of playing, but this work is quite painstaking. Boeing does not have the only correct answer, it also reveals the player's sensibility. Also, when playing in a quartet or orchestra, it may be necessary to match it with other instruments such as viola and cello. It is no exaggeration to say that the development of an automatic Boeing machine is a dream of a stringed instrument player.
So, even if you don't go to the Boeing automatic attachment machine, can you save time by reading Boeing from other people's performance videos and referring to it? I thought.
The strategy is this.
--Using an open source library that can estimate poses, the coordinates of the right wrist holding the bow are recorded moment by moment from the performance video. --Overlay the trajectory of the coordinates of the right wrist on the score.
With this, you should be able to understand the Boeing of the performance video perfectly! The library used is an open source tf-pose-estimation that implements OpenPose for Tensorflow.
OpenPose is a technology announced by CMU (Carnegie Mellon University) at CVPR2017, an international conference on computer vision, that detects keypoints and estimates the relationship between keypoints. With OpenPose, you can see the coordinates of the feature points in the human body, such as the position of joints, as shown below.
https://github.com/CMU-Perceptual-Computing-Lab/openpose
tf-pose-estimation is an implementation of the same neural network as OpenPose for Tensorflow. I decided to use this this time because errno-mmd's tf-pose-estimation extension seemed to be useful. I had an option to output the pose-estimated two-dimensional joint position information to a file in JSON format, and I wanted to use this.
I had no idea about the free violin performance videos, so I used a recording of my performance for analysis. The song played is the first 10 bars of Mozart's Eine Kleine Nachtmusik 1 movement. I played at a constant tempo using the metronome.
By default, tf-pose-estimation marks various feature points such as eyes, shoulders, elbows, wrists, and ankles, and connects them with lines. This time, I modified the code a little to mark only the right wrist.
tf-pose-estimation/tf_pose/estimator.py
440 #Draw a circle image only on the right wrist
441 if i == CocoPart.RWrist.value:
442 cv2.circle(npimg, center, 8, common.CocoColors[i], thickness=3, lineType=8, shift=0)
449 #Disable line drawing connecting feature points
450 # cv2.line(npimg, centers[pair[0]], centers[pair[1]], common.CocoColors[pair_order], 3)
The execution environment used Google Colaboratory. https://github.com/errno-mmd/tf-pose-estimation After executing the setup command according to Read.md of, execute the following command.
%run -i run_video.py --video "/content/drive/My Drive/violin_playing/EineKleineNachtmusik_20191226.mp4" --model mobilenet_v2_large --write_json "/content/drive/My Drive/violin_playing/json" --no_display --number_people_max 1 --write_video "/content/drive/My Drive/violin_playing/EineKleine_keypoints_20191226.mp4"
Using the performance video uploaded to Google Drive in advance as an input, a video depicting the joint point of the right wrist and a JSON file in which the coordinates of the joint point for each frame are written are output.
Here is the output video cut out at intervals of a few seconds and made into a gif.
As far as I can see, it seems that it traces the right wrist joint point with a certain degree of accuracy.
A JSON file containing the coordinates of the feature points is output every frame (1/60 seconds). Here is the JSON file for the 10th frame.
000000000010_keypoints.json
{
"version": 1.2,
"people": [
{
"pose_keypoints_2d": [
265.5925925925926,
113.04347826086956,
0.7988795638084412,
244.55555555555557,
147.82608695652175,
0.762155294418335,
197.22222222222223,
149.56521739130434,
0.6929810643196106,
165.66666666666669,
189.56521739130434,
0.7044630646705627,
220.88888888888889,
166.95652173913044,
0.690696656703949,
289.2592592592593,
146.08695652173913,
0.5453883409500122,
299.77777777777777,
212.17391304347825,
0.6319900751113892,
339.22222222222223,
177.3913043478261,
0.6045356392860413,
213,
253.91304347826087,
0.23064623773097992,
0,
0,
0,
0,
0,
0,
268.22222222222223,
276.52173913043475,
0.2685505151748657,
0,
0,
0,
0,
0,
0,
257.7037037037037,
106.08695652173913,
0.8110038042068481,
270.85185185185185,
107.82608695652173,
0.7383710741996765,
231.4074074074074,
107.82608695652173,
0.7740614414215088,
0,
0,
0
]
}
],
"face_keypoints_2d": [],
"hand_left_keypoints_2d": [],
"hand_right_keypoints_2d": [],
"pose_keypoints_3d": [],
"face_keypoints_3d": [],
"hand_left_keypoints_3d": [],
"hand_right_keypoints_3d": []
}
Reading the code, it seems that the 13
th element (shaking from 0) of pose_keypoints_2d is the value of the y coordinate of the right wrist. In the example of the 10th frame above, it would be 166.95652173913044
.
I would like to make a graph with matplotlib. First, collect the target value from the JSON file output for each frame.
import json
import pprint
import os
import glob
import numpy as np
files = glob.glob('/content/drive/My Drive/violin_playing/json/*')
x = list(range(len(files)))
y = []
for file in files:
with open(file) as f:
df = json.load(f)
y.append(df['people'][0]['pose_keypoints_2d'][13])
The variable x stores the number of frames, and the variable y stores the value of the y coordinate of the right wrist joint point.
Then graph with matplotlib. The x-coordinate memory is 30 frames apart. This is because I played Ainek at a tempo of quarter note = 120, so 30 frames are exactly equivalent to one quarter note beat, making it easier to see the correspondence with the score.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
fig = plt.figure(figsize=(120, 10))
ax = fig.add_subplot(1, 1, 1)
ax.xaxis.set_major_locator(ticker.MultipleLocator(30))
ax.plot(x, y, "red", linestyle='solid')
The output graph is here.
Here is the result of superimposing it on the score. The overlay was done manually.
When it's a down bow, it should be going down, and when it's an up bow, it should be going up, but ... well ... I wonder if it's almost right (^ _ ^;) It's like that.
To be honest, I can't really reach the level of utilization as it is. It seems to be difficult to identify the Boeing of small passages such as 16th notes with this method. If it is a more relaxed song, it may be helpful to some extent.
However, the following issues remain in terms of matching with the musical score.
――The width of one bar is not the same on the score --Similarly, the beats are not evenly spaced --The tempo changes during the song (Adagio → Allegro, etc.) --Even if the tempo notation is the same, the tempo fluctuates slightly according to the phrase.
If the matching with the musical score cannot be done automatically, the work of attaching Boeing will not be efficient, so the above problems must be overcome. Recognizing the pitch and mapping it to the musical score may be one solution. The difficulty seems to be high (^ _ ^;)
This time, I focused on improving the efficiency of the work of attaching Boeing, but the approach of visualizing the difference in how to handle bows for experienced and beginners and getting noticed from it seems to be interesting. I felt that the technology for estimating poses from this video is a wonderful technology with various potential applications.