Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API

Transcription accuracy is lower than I expected

As I mentioned at the end of the article that summarized the previous How to use Google Speech API, I ran into the problem that the character recognition accuracy was lower than I expected. I did.

It seems that about 80% of the characters are transcribed in rebuild.fm, but in my case, [half of them are not recognized by experience](https://github.com/ysdyt/podcast_app/blob/master/text/google_speech_api /001.txt) Impression. It wasn't perfect, but it was pretty devastating, even though I expected to understand the conversation when I read the transcribed text.

Based on the premise that "Speech API is not bad, my preprocessing is bad", I tried various combinations of parameters and the presence or absence of preprocessing and compared the accuracy. The purpose of this time is to find the best preprocessing method in the Google Speech API.

Sound source data to be verified

I targeted the first sound source of the podcast "Platinum Mining.FM" that I recorded and distributed. What is uploaded is a clear sound source that has been edited, but since I am careful about recording, such as recording in a quiet room, the sound quality is so clear that there is not much difference even with raw data.

The length of the sound source is just 1h. I cut it at the position 1h from the start by editing. It doesn't make much sense to be 1h, but I just wanted to try it on a long sound source and made it a good experiment target.

Verification items

Previous article In order to confirm the temporary construction made at the end, this time ** "Presence / absence of noise reduction processing" "Presence / absence of volume adjustment" "sample We will verify the three items of "difference in rate hertz" **.

-** Noise reduction processing ** ・ ・ ・ White noise processing performed by Audacity. The execution method is here

-** Volume control processing ** ・ ・ ・ Processing to perform automatic volume control with Leverator. The execution method is here

-** sample rate hertz ** ・ ・ ・ Audio sampling rate. Although sampling at 16kHZ is recommended in the Speech API, it is noted that sound sources originally recorded at 16kHZ or higher are not resampled at 16kHZ and are input to the Speech API at the sampling rate as they are recorded. The contents and execution method are here. Since the default sampling rate of the microphone used for recording is 44kHz, we will try two patterns of 16kHz or 44kHz this time.

I made a total combination of parameters for the above three items. As shown in the table below, there are 8 types in total.

No. file name Noise reduction processing Volume control sample rate hertz file size
1 01_001_NoiRed-true_lev-true_samp16k.flac True True 16k 73.1MB
2 02_001_NoiRed-true_lev-true_samp44k.flac True True 44k 169.8MB
3 03_001_NoiRed-true_lev-false_samp16k.flac True False 16k 64.7KB
4 04_001_NoiRed-true_lev-false_samp44k.flac True False 44k 147.4KB
5 05_001_NiRed-false_lev-true_samp16k.flac False True 16k 75.8KB
6 06_001_NiRed-false_lev-true_samp44k.flac False True 44k 180.9KB
7 07_001_NiRed-false_lev-false_samp16k.flac False False 16k 68.1KB
8 08_001_NiRed-false_lev-false_samp44k.flac False False 44k 160.2KB

As far as the file size is concerned, if "sample rate hertz" is set to 16k, the file size will drop sharply. This is normal. It was unclear how the presence or absence of "noise reduction processing" and "volume control" affects the file size.

By the way, the sound source released every time on Shirokane Mining.FM

--Noise reduction processing → True, --Volume control → True

It corresponds to the same processing as No.1 of.

Method of verification

Execution method

The execution method of Google Speech API was done according to Previous article.

Evaluation method

It is really dull to check whether the characters are transcribed correctly one by one, so here we will roughly qualitatively check which parameter is the most accurate transcription.

However, it is difficult to evaluate it as it is, so it seems to be a quantitative result.

--Total number of transcription characters --Total number of words extracted by mecab (with duplicates) --Number of nouns extracted by mecab (with duplication) --Total number of words extracted by mecab (no duplication) --Number of nouns extracted by mecab (no duplication)

I will put out as an evaluation item.

result

Quantitative results

The values of the quantitative results are as follows.

No. file name Noise reduction processing Volume control sample rate hertz Number of transcription characters Total number of duplicated words Number of noun words with duplicates Total number of words without duplication Number of noun words without duplication
1 01_001_NoiRed-true_lev-true_samp16k.flac True True 16k 16849 9007 2723 1664 1034
2 02_001_NoiRed-true_lev-true_samp44k.flac True True 44k 16818 8991 2697 1666 1030
3 03_001_NoiRed-true_lev-false_samp16k.flac True False 16k 16537 8836 2662 1635 1026
4 04_001_NoiRed-true_lev-false_samp44k.flac True False 44k 16561 8880 2651 1659 1019
5 05_001_NiRed-false_lev-true_samp16k.flac False True 16k 17219 9191 2758 1706 1076
6 06_001_NiRed-false_lev-true_samp44k.flac False True 44k 17065 9118 2727 1675 1055
7 07_001_NiRed-false_lev-false_samp16k.flac False False 16k 16979 9045 2734 1679 1047
8 08_001_NiRed-false_lev-false_samp44k.flac False False 44k 17028 9120 2727 1664 1040

It's a little difficult to understand if it's a table, so I made a graph.

スクリーンショット 2020-01-02 21.13.00.png スクリーンショット 2020-01-02 21.13.40.png スクリーンショット 2020-01-02 21.13.55.png

--Considering all items ** The best result was No.5 ** (No noise reduction processing, volume adjustment, sampling 16kHz) -** What was bad was No.3 or No.4 ** (Both are bad because they are the same)

What can be said from the quantitative results is

--Noise reduction processing ** (False) ** is better --The volume adjustment process should be ** (True) **

--Since the presence or absence of "noise reduction processing" and "volume adjustment processing" has a greater effect on sampling, it can be said that there is almost no difference between 16kHz or 44kHz, but strictly speaking, "noun words without duplication" In the "number" item, 16kHz is always a slightly better result, so ** 16kHz seems to be better **.

Qualitative results

Qualitatively check the transcription results of the best No. 5 and the worst No. 3 (No. 4 was fine, but for the time being).

Transcription result

The images are arranged side by side for easy comparison of the range of one part of the entire transcription. The left is No.5 and the right is No.3.

スクリーンショット 2020-01-02 10.35.14.png

Well, I'm not sure.

Frequent nouns

I'm not sure, so let's compare the "noun words without duplication" and "the number of counts" output from No. 5 and No. 3, respectively. Let's try to display the words that have appeared 11 times or more.

スクリーンショット 2020-01-02 20.38.27.png

Frequently occurring words look almost the same.

Word cloud

By the way, although the information does not increase in particular, I will put out a word cloud and take a look at it somehow. No. 5 on the left and No. 3 on the right.

By the way, "Shochu" is a transcription mistake of "resident".

スクリーンショット 2020-01-02 20.44.06.png

Summary

Of the three items verified, the combination that gave the best quantitative results was

--Noise reduction processing → None --Volume adjustment processing → Yes

have become. As stated in the API official, it seems better not to have noise reduction processing. On the other hand, it is better to have volume adjustment processing, so it seems that it is better for API that the volume is clear (not small). Lastly, it was said that those recorded at 16kHz or higher should not be resampled, but even when recording at 44kHz, it seems better to resample to 16kHz in terms of API (however, it does not seem to affect the big picture). But.)

When qualitatively comparing the transcription result output by the combination of items with the best results (No. 5) and the transcription result output by the combination of items with the worst results (No. 3), the transcription results It seemed that there was almost no difference in the frequently-used words that were successful, and it was found that the difference in parameters did not make a big difference in the transcription content itself. For words with rare appearances, there may be cases where new transcription is successful, but I have not confirmed it because it is out of the scope of this verification (because I do not have the energy to confirm so much). ..

The mystery of "about 80% of the transcription is possible with rebuild.fm" deepens, but I think that the transcription accuracy of the Google Speech API is the limit for my recordable sound source quality. The road to automatic transcription is still steep.

Future Work

I would like to try the best accurate Google Speech API transcription result vs. Amazon Transcribe obtained this time.

Many of the "compared" articles I've seen say good / bad by transcribing a few lines (or minutes). Or, there are many things that are done for "too clear sound sources" such as news videos.

About Transcribe Buzzing Blog also talks about high-precision transcription in English. It is known that the accuracy of English is high in the field of natural language processing, but the point is whether it is Japanese.

What I want to know is how far I can fight against Japanese sound sources, noisy sound sources recorded by amateurs like podcasts, long sound sources of about 1h, and sound sources where multiple people are crosstalking with just the API. So I would like to verify it.

Recommended Posts

Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API
Speech transcription procedure using Python and Google Cloud Speech API
Speech transcription procedure using Google Cloud Speech API
Automatic voice transcription with Google Cloud Speech API
Investigate the relationship between TensorFlow and Keras in transition
Speech recognition of wav files with Google Cloud Speech API Beta
Scraping the schedule of Hinatazaka46 and reflecting it in Google Calendar
I tried using docomo speech recognition API and Google Speech API in Java
Graph of the history of the number of layers of deep learning and the change in accuracy
The google search console sitemap api client is in webmasters instead of search console
Google Cloud Speech API vs. Amazon Transcribe
Streaming speech recognition with Google Cloud Speech API
The subtle relationship between Gentoo and pip
About the relationship between Git and GitHub
Summary of the differences between PHP and Python
The answer of "1/2" is different between python2 and 3
I tried using the Google Cloud Vision API
Comparison of cloud speech recognition accuracy of 4 major companies
About the difference between "==" and "is" in python
Until you can use the Google Speech API
The nice and regrettable parts of Cloud Datalab
Consideration of the difference between ROC curve and PR curve
Investigating the relationship between ice cream spending and temperature
Difference between Ruby and Python in terms of variables
Visualization of the connection between malware and the callback server
Play music by hitting the unofficial API of Google Play Music
Compare "relationship between log and infinity" in Gauche (0.9.4) and Python (3.5.1)
The relationship between brain science and unsupervised learning. Maximize information amount Unsupervised learning MNIST: Google Colabratory (PyTorch)