I want to translate YouTube videos into text. With GCP's Cloud Speech-to-Text, you can transcribe even long videos.

Of course, you should be able to use other audio files as well as YouTube audio.

What I used

Google cloud speech to text
ffmpeg

Caution

There is a charge. If you exceed the 60-minute free tier, you will be charged for voice processing every 15 seconds. Please refer to this page for the price.

procedure

Download YouTube videos audio only

I was able to download only the audio of the video on this site. It's a suspicious site with advertisements, but I was able to download it properly.

Let's call the downloaded file sample_audio.mp3.

This time, Ikeda Hayato, also known as professional blogger Ikeda Hayato, [Explanation of the reason] The breakfast of Buffett, the world's number one investor, is a Mac hamburger. was chosen.

The reason is like this.

--One speaker ――Speaking speed is not too fast --The recording environment is in place (using a handheld microphone), and there is little noise or noise. --No background music or sound effects --Not too long

Video with multiple people talking at the same time or with ambient noise seems to be inaccurate, so let's start with this video.

Convert from mp3 to flac

As you can see on this page, Cloud Speech-to-Text does not seem to support mp3 as of November 2019. So you need to convert it to flac.

You can easily convert using ffmpeg.

ffmpeg -i sample_audio.mp3 -ar 16000 -ac 1 sample_audio.flac

-ar is the sampling frequency. Set to 16,000Hz according to this Quickstart example. -ac is the number of channels. An error occurred if it was not set to monaural (= 1).

The command will create a file called sample_audio.flac.

Upload to Cloud storage

It seems that it can be transcribed locally, but this time we will target files on Cloud storage. Create a suitable bucket and upload it with the following command if Cloud SDK is installed.

gsutil cp sample_audio.flac gs://[YOUR BACKET]

If the file is small, you can upload it from your browser.

Transcription

It's almost the same as the function in the official documentation.

`transcribe.py`


def transcribe_gcs(gcs_uri):
    print(f'Processing {gcs_uri}')
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='ja-JP')

    operation = client.long_running_recognize(config, audio)
    print('Waiting for operation to complete...')
    response = operation.result()
    text = ''.join([result.alternatives[0].transcript for result in response.results])
    return text

The full text will be transcribed and returned, so please use it as a text file or a csv file.

result

Transcription is completed immediately for a few minutes of video. The previous video of Mr. Ikeda was 4 minutes 08 seconds, but it was completed in about 2 minutes. Click here for the full text of the transcription. (It's long so it's folded)

Transcription result full text

Yes, today is IKEA. Today, the millionaire's pamphlet says that he will go to McDonald's. I'd like to tell you a known fact. I'm afraid that I've supported 1000 mm. 183 Since I exceeded 100 million yen on a date, I'm in a situation where I don't have to work hard. That's right, I went to conveyor belt sushi the other day, so I uploaded it to Twitter. As usual, the shit rice arrived. What kind of fucking rice is that Neko says that going to conveyor belt sushi is absolutely love and lying, even though she has 100 million yen. Today, with glasses kids, I promised something like a cool laugh. I received it from about junior high school students. I feel like I'm going to go to conveyor belt sushi. I'm going to go to conveyor belt sushi. I'm going to mess up. That's right, it's true, I'm very sorry to say that it's not so delicious, and even if my mom is doing her best, I don't feel like I'm so good at tears, but I'm full and I'm going with my children Is a very nice facility, isn't it? Why do you go to conveyor belt sushi like that? I want to go to counter sushi again. One person says that you should go with your wife who is good value for sushi for about 20,000 yen. It's very famous, but it's a situation where it's really a dream if it's difficult to get into a child, but it doesn't matter. It ’s the misfortune of the top. Do you know this person ’s breakfast? This was taken up in Buffett ’s documentary and became a hot topic all over the world, but it ’s McDonald ’s. But for whom, breakfast is Mac. It's amazing. It's really like a McDonald's in the morning. I was surprised. The left side has about 8 to 9 trillion yen, so the couple has 8 to 9 trillion yen. But McDonald's is amazing, isn't it? It's quite impactful and I feel it's interesting, isn't it? Why is this lever set 3 so I'm really rich when I'm heading for it? If not, why don't you eat it? I'm sure it's rational for him to eat this. I wonder if there are many people who think that cat Mac is bad for health, I thought, but that bucket was 88 years old when I checked the age It's a good breakfast, isn't it? The house is also healthy. Even if McDonald's is bad for my health, I still feel that it's okay if I can go to breakfast until I'm 88 years old. After all, eating McDonald's is probably this person. It's been a week, isn't it? Everyone, don't you think that the dojo is lost for breakfast? Mauchi also has children, so I always say the same breakfast, so I'm thinking about breakfast. I'm thinking about making bread today, but that's quite a hassle, isn't it? I think it's stupid, isn't it? It's fun, but it's fun, but it's hard. I wonder if it's important to take another one so that you don't get lost in such a place. I don't have breakfast. Well, after all, it's breakfast. Maybe Buffett thinks so and goes from pitch black to the world. You can eat it anywhere in the middle. Don't hesitate. Even if you say it's a McDonald's, I think it's the best thing that you brought when you ordered it. It's the same with Steve Jobs. It was said that Steve Jobs wore the same clothes every day, and there are actually some pictures left. The pictures of Kagayaki are wearing the same turtleneck all the time, so I don't get lost. I don't know what clothes to wear today, even if I don't get lost. I hate Japan and I, so I always wear the same clothes. I'm always wondering which clothes to wear today. It's really a waste of time to wear a breakfast, so I made it a person who hates clothes, I hate it, I feel like I hate it, it's very important to make such a habit I think that's why I can bring 8 to 9 trillion yen for breakfast 3 Why do you think it's really doubtful to eat a McDonald's? It's probably a week for him. I'm eating McDonald's like this so that I don't use that kind of extra willpower, and I'm waiting for myself, even if it's white, I'm using it for various decisions, so I'm going to be a millionaire I hope I can understand the feeling that it has become I think the influence is like that. It starts with Max rice. Conveyor belt sushi goes. I started with the story that conveyor belt sushi goes normally. I tried to drop Buffett's story. Keep up the good work every day. I would like to provide a fun story. Well, please give me a 6000 type poti. Thank you for your attention.

The opening ** "Yes, today is IKEA" ** is suddenly disappointing, but spoken words such as "Ah" are also raised. The accuracy is quite high, but the proper nouns are still weak after that. It is inevitable.

However, I'm very grateful for this alone because it would be a pain to type the whole sentence while listening to the voice myself. Because sentences of this level are automatically created in 2 minutes! ??

afterwards

After that, I tried transcribing a 1 hour and 40 minute video. It took about 20 minutes, but the whole text was properly transcribed. It's quite excellent. It may be possible to transcribe a specific person's video on YouTube and use it as a data source for natural language processing.

that's all

References

Thank you for the useful information.

-Voice transcription procedure using Google Cloud Speech API -Transcription of long-playing voice using Google Speech API with gcloud command

Transcription of YouTube videos using GCP's Cloud Speech-to-Text