――The audio data used this time is a recording of my research meeting (graduate school seminar) with AirPods. The number of participants in the meeting is 3-5. Please note that the audio contains personal information and cannot be disclosed. --Data volume: 300 remarks (about 27 minutes) ――The quality of voice includes a lot of daily life sounds and noise. The quality of the voice is not good (compared to the voice recognition corpus such as CSJ) ――The voice recognition accuracy of Google etc. is quite good (WER single digit even in Japanese) as published in the paper. --The accuracy is quite good because it uses good quality audio for research. ―― *** There are not many reports on the recognition accuracy of voice in daily life *** ――This time, I investigated *** how accurate it is to recognize voices in daily life. In addition, since it is a voice related to research, it contains many technical terms. I am also wondering how much technical terms are supported. -Please refer to the article Using voice recognition services of Amazon, Google, IBM, Microsoft for a summary of how to use the API.
--In addition to Amazon, Google, IBM, and Microsoft, the recognition accuracy of Kaldi (learned with CSJ, JNS, S-JNAS, CEJC) is also listed.
GCP
WER: 0.3344722854973424
CER: 0.2765527007889945
AWS
WER: 0.36209150326797385
CER: 0.2218905472636816
Azure
WER: 0.28109824430332464
CER: 0.21596337579617833
Watson
WER: 0.4107744107744108
CER: 0.29126794258373206
Kaldi
WER: 0.616504854368932
CER: 0.47915630285543725
--The results show that Microsoft is the most accurate. I thought Google was the best, but it wasn't. Looking at WER, you can see that even the best Microsoft is about 28%. If the quality of the voice is good, the WER will be improved to one digit, but it was found that the accuracy drops to this extent in an environment with a lot of daily noise and noise such as the voice of daily life. However, since Kaldi is miserable, I think that speech recognizers such as Google and Microsoft can handle some noise.
――I will post one of the recognition results for the time being
Correct sentence:Since it is possible to calculate the closeness, by using this, the striking sound is expressed in the distance matrix for each material, and the density is expressed in this way, so that this two-dimensional map can be used. I tried to replace it, but it's amazing to do something.
Google:The proximity can be calculated, so even if you use this, you can replace the striking sound with a distance matrix for each material and replace it with this two-dimensional map that looks like this. But it's amazing to do something
Amazon:Since it is possible to calculate the closeness, even if this is used, the striking sound is represented by a node like this in the distance matrix for each material, so this human being's Replacing it with a map is just a matter of course!I tried it, but it's amazing to do something
Microsoft:I used this because I can calculate the closeness, but I used this two-dimensional map because there was a way to express the striking sound for each material in a distance matrix with the same feeling as before. I tried to replace it with one, but it's amazing to do something.
IBM:Since it is possible to calculate the control, even if this is used, it cannot be said that the striking sound is represented by a matrix for each material on the clock, as it was like this. Replace it with the human map here. I tried to do it for the time being, but it's amazing to do something
Kaldi:Since it is possible to calculate the proximity for 5 days, it is not necessary to use this, so the hitting sound is removed for each material. 7 For the forestry rate 7 I've been passive once, especially to replace it, but it's convenient to do something.
Recommended Posts