Many of the articles in the "Comparison of Transcription APIs" that you can see in a quick glance are good / bad by transcribing very short lines (or minutes). Or, there are many things that are done for "too clear sound sources" such as news videos. Blog that buzzed about Amazon Transcribe also talks about high-precision transcription in English. It is known that the accuracy of English is high in the field of natural language processing, but I am concerned about how it is in Japanese.
What I want to know is --Japanese sound source --Sound source with some noise recorded by amateurs like podcasts ――Long sound source of about 1h --Sound source where multiple people are crosstalking Since it was how far we could fight against voice data with such characteristics (whether it could be transcribed) with just the API, we have verified various things.
In the first article, I summarized how to use the Google Cloud Speech API and made a hypothesis that transcription accuracy is low. -Voice transcription procedure using Python and Google Cloud Speech API
In the second article, I experimented with preprocessing methods to improve transcription accuracy with the Google Cloud Speech API. -Survey on the relationship between speech preprocessing and transcription accuracy in Google Cloud Speech API
This time, I would like to summarize the ** limit point ** of the transcription API by transcribing the Google Speech API transcription result with the best accuracy obtained last time vs. Amazon Transcribe, which has become a hot topic recently. I think.
If you want to know only the result, please read only the "Google Cloud Speech API vs. Amazon Transcribe result summary" item at the bottom.
** * Note: The conclusion obtained this time is the result of the voice data used this time and the preprocessing performed. Please understand that this result does not conclude the performance of the API. ** **
Amazon's automatic transcription API Amazon Transcribe is a service that has existed for a long time, but at the end of November 2019, it became compatible with Japanese.
It's very easy to use compared to the Google Speech API, so I'll omit it here. Official tutorial and Classmethod's blog See / cloud / aws / amazontranscribe-japanese /) etc.
Not to mention transcription, we will do it for Japanese.
The file formats that Amazon Tanscribe can handle are mp3, mp4, wav, flac. The Google Speech API is a nice point because I could not specify general file formats such as mp3 and wav. Since the audio sampling rate is also automatically recognized, it seems that there is no need to specify it manually like Google does. Convenient.
By the way, Amazon Transcribe allows you to specify your own parameters as optional in addition to the required parameters.
To summarize briefly
Since there are two speakers in the sound source used this time, "Speaker identification" is 2, but it seems that it is 2 by default, so I did not specify it in particular, and executed all with the default (all not specified).
The processing time was about 10 minutes with a 1h sound source. The Google Speech API took just over 15 minutes, so the processing time is faster with Amazon transcribe.
Same as Last time, No1 to No8 sound source (flac file) created by combining various preprocessing parameters is used. The sound source data is in here, so if you want to use it, please.
It targets the same file and runs by default without specifying any optional parameters on the Amazon Transcribe side, so it should be safe to compare on the same level as the results of the Google speech API.
The transcription output result of Amazon Transcribe is json. The number of characters and words was also counted in the same way as Last time. (Click here for the actual processing code)
For horizontal skewer comparison, the evaluation method is the same as Previous.
No. | file name | Noise reduction processing | Volume control | sample rate hertz | Number of transcription characters | Total number of duplicated words | Number of noun words with duplicates | Total number of words without duplication | Number of noun words without duplication |
---|---|---|---|---|---|---|---|---|---|
1 | 01_001_NoiRed-true_lev-true_samp16k.flac | True | True | 16k | 19320 | 10469 | 3150 | 1702 | 1057 |
2 | 02_001_NoiRed-true_lev-true_samp44k.flac | True | True | 44k | 19317 | 10463 | 3152 | 1708 | 1060 |
3 | 03_001_NoiRed-true_lev-false_samp16k.flac | True | False | 16k | 19278 | 10429 | 3166 | 1706 | 1059 |
4 | 04_001_NoiRed-true_lev-false_samp44k.flac | True | False | 44k | 19322 | 10453 | 3170 | 1706 | 1058 |
5 | 05_001_NiRed-false_lev-true_samp16k.flac | False | True | 16k | 19660 | 10664 | 3209 | 1713 | 1054 |
6 | 06_001_NiRed-false_lev-true_samp44k.flac | False | True | 44k | 19653 | 10676 | 3211 | 1701 | 1052 |
7 | 07_001_NiRed-false_lev-false_samp16k.flac | False | False | 16k | 19639 | 10653 | 3209 | 1702 | 1052 |
8 | 08_001_NiRed-false_lev-false_samp44k.flac | False | False | 44k | 19620 | 10638 | 3213 | 1702 | 1047 |
The figure is below.
Almost all the results were the same between the samples. What can be said from the overall result is
--In Amazon Transcribe, unlike the result of Google Speech API, ** It is not affected by voice preprocessing ** (* The preprocessing performed here is noise reduction processing by Audacity and volume adjustment processing by Levelator. The results may be different under other conditions as well.)
Here, the No. 2 result with the highest "** Noun word number without duplication **" (although it is almost an error level) is used as the representative value as the best result in Amazon transcribe.
Since it may not be affected by the presence or absence of preprocessing, I tried Amazon transcribe with ** recorded raw wav file (No.0) ** and got the result.
The only difference between this wav file without any pre-processing and the No. 2 file is ** stereo or monaural **. In No.2, when converting from wav to flac file, stereo → monaural conversion is performed at the same time. This was necessary because the Google speech API only accepts monaural files.
No. | file name | Noise reduction processing | Volume control | sample rate hertz | Number of transcription characters | Total number of duplicated words | Number of noun words with duplicates | Total number of words without duplication | Number of noun words without duplication |
---|---|---|---|---|---|---|---|---|---|
0 | 001.wav | False | False | 44k | 19620 | 10637 | 3212 | 1701 | 1046 |
2 | 02_001_NoiRed-true_lev-true_samp44k.flac | True | True | 44k | 19317 | 10463 | 3152 | 1708 | 1060 |
Strictly speaking, "total number of words without duplication" and "number of noun words without duplication" are higher in No. 2, but there is not much difference. If you can get almost the same accuracy without pre-processing & stereo ⇔ monaural conversion, it is best to "insert a raw wav file" that does not require pre-processing.
In Amazon Transcribe, the results were almost the same from No. 1 to No. 8, so without comparing the qualitative results between No. 1 to No. 8, "Best result of Google Cloud Speech API" and "Best result of Amazon Transcribe" I would like to compare the "results".
Google Cloud Speech API vs. Amazon Transcribe
Compare the values of the Google Cloud Speech API (best result No. 8) confirmed last time and the Amazon Transcribe (best result No. 2) confirmed this time. The results on Google are taken from Last Results.
On the other hand, the total number of words and nouns excluding duplicates are almost the same. As much as I planned ...
I will roughly compare what the content of the text was like in the same way as Last time.
The images are arranged side by side for easy comparison of the range at the beginning of the transcription. Google is on the left and Amazon is on the right. It's difficult to judge, but I feel that Google's transcription is still better. It's a comparison of acorns. (* Here, only the result of Google has line breaks, but both Google and Amazon originally transcribe with line breaks. The accuracy is delicate. So here, both of them are post-processed to delete line breaks by the author. going.)
Let's compare "noun words without duplication" and "the number of counts" on Google and Amazon respectively. Let's try to display the words that have appeared 11 times or more.
It seems that Amazon has more words that can be recognized. However, since most of the words are duplicated between Google and Amazon, it also shows that the transcription performance of both is not significantly different. In the result of Amazon, our company name "BrainPad" is also omitted, so it is good.
If you want to recognize more words (in this voice data), Amazon seems to be better. (Check if the word is meaningful)
Noun word cloud in the flow. The above visualization. Google is on the left and Amazon is on the right.
As a result of Google Cloud Speech API vs. Amazon Transcribe,
――For both Google and Amazon, Japanese transcription is (* just using the API with a simple preprocessing for this voice data) ** Practical transcription seems impossible ** --The result is almost the same when compared by the number of words that can be transcribed ** --Amazon Transcribe got the same accuracy as Google without preprocessing, so ** Amazon Transcribe wins for convenience ** --If you operate the console on the browser and transcribe with the GUI instead of installing the SDK and hitting the API with the CLI (almost all non-engineers should use this method), ** Difficulty of use Needless to say, Amazon Transcribe wins . Rather, the Google API is too difficult for non-engineers. ( By the way, transcription by voice input of Google Doc has become popular among non-engineers recently. It seems that it is) -* Processing time is a little faster with Amazon transcribe **. For a 1h file, Google takes a little over 15 minutes, while Amazon takes about 10 minutes.
personally,
As for Japanese transcription, ** both are far from practical level of accuracy **, so I have the impression that the transcription API can only be used for ** word extraction **. (And even if only words can be extracted, there is almost no use ...)
And if it is used only for word extraction, my personal conclusion is that ** Amazon Transcribe is good ** in that it can be used without preprocessing, it is easy to use with GUI, and the processing time is fast.
I haven't abandoned the possibility that the accuracy of transcription will improve if I can take clearer voices using more apt recording equipment (= improve the voice quality of input), but my recording environment (around 16000 yen) Since it is difficult for general users to prepare more than that (using an external microphone), it is impossible with the current technology to "transcribe Japanese characters quickly and at a low price using API". I think it is. It seems that Japanese transcription cannot be done overnight.
It's kind of unclear, so if you know the tips that "If you do this, you'll be able to transcribe and recognize it", please comment!
Recommended Posts