I tried transcribing the news of the example business integration to Amazon Transcribe

Introduction

It is said that AWS's automatic voice recognition service Amazon Transcribe supports Japanese, so try transcribing by inputting Japanese voice. I did.

Subject

I used the following news video.

-Yahoo and LINE integration announced-TV Tokyo NEWS

One announcer reads the news for about a minute, and there is no conversation between multiple people.

Transcription method

Transcribe,

--Management console

You can use it at, but this time I will try using the AWS CLI.

1. Request a conversion

Prepare the following JSON in advance, and then execute ʻaws transcribe start-transcription-job`.

request.json


{
  "TranscriptionJobName": "yl",
  "LanguageCode": "ja-JP",
  "MediaFormat": "mp3",
  "Media": {
    "MediaFileUri": "https://foobar.s3-ap-northeast-1.amazonaws.com/yl.mp3"
  }
}
$ aws transcribe start-transcription-job --cli-input-json file://request.json

When the request is accepted, the following response will be returned.

{
    "TranscriptionJob": {
        "TranscriptionJobName": "yl",
        "TranscriptionJobStatus": "IN_PROGRESS",
        "LanguageCode": "ja-JP",
        "MediaSampleRateHertz": 48000,
        "MediaFormat": "mp3",
        "Media": {
            "MediaFileUri": "https://foobar.s3-ap-northeast-1.amazonaws.com/yl.mp3"
        },
        "CreationTime": 1574510851.993
    }
}

2. Check the status of the conversion job

Check the status of the conversion job with the subcommand list-transcription-jobs.

$ aws transcribe list-transcription-jobs --job-name-contains "yl"

If the conversion is complete, the status will be returned as COMPLETED.

{
    "TranscriptionJobSummaries": [
        {
            "TranscriptionJobName": "yl",
            "CreationTime": 1574510995.946,
            "CompletionTime": 1574511071.683,
            "LanguageCode": "ja-JP",
            "TranscriptionJobStatus": "COMPLETED",
            "OutputLocationType": "SERVICE_BUCKET"
        }
    ]
}

It took about 1 minute and 15 seconds to complete this 1-minute audio conversion job.

3. Get the conversion result

Get the URI of the conversion result file with the subcommand get-transcription-job.

$ aws transcribe get-transcription-job --transcription-job-name "yl"
{
    "TranscriptionJob": {
        "TranscriptionJobName": "yl",
        "TranscriptionJobStatus": "COMPLETED",
        "LanguageCode": "ja-JP",
        "MediaSampleRateHertz": 44100,
        "MediaFormat": "mp3",
        "Media": {
            "MediaFileUri": "https://foobar.s3-ap-northeast-1.amazonaws.com/yl.mp3"
        },
        "Transcript": {
            "TranscriptFileUri": "https://s3.ap-northeast-1.amazonaws.com/aws-transcribe-ap-northeast-1-prod/(Abbreviation)/asrOutput.json?X-Amz-Security-Token=xxxxxxxx&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20191123T122051Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=xxxxxxxx&X-Amz-Signature=xxxxxxxx"
        },
        "CreationTime": 1574510995.946,
        "CompletionTime": 1574511071.683,
        "Settings": {
            "ChannelIdentification": false
        }
    }
}

Since TranscriptFileUri is the URI of the conversion result file, get the file from here.

The obtained file was as follows (excerpt).

The results are divided into transcripts and ʻitems`.

{
  "jobName": "yl",
  "accountId": "xxxxxxxxxxxx",
  "results": {
    "transcripts": [
      {
        "transcript": "Search service Yahoo Japan(Abbreviation)Announcement"
      }
    ],
    "items": [
      {
        "start_time": "0.04",
        "end_time": "0.54",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Search"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "0.54",
        "end_time": "1.18",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "service"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "1.25",
        "end_time": "1.73",
        "alternatives": [
          {
            "confidence": "0.5202",
            "content": "Yahoo"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "1.73",
        "end_time": "2.1",
        "alternatives": [
          {
            "confidence": "0.5202",
            "content": "Japan"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "2.1",
        "end_time": "2.23",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "of"
          }
        ],
      },
//Abbreviation
      {
        "start_time": "59.13",
        "end_time": "59.48",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Presentation"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "59.48",
        "end_time": "59.63",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Shi"
          }
        ],
        "type": "pronunciation"
      },
      {
        "start_time": "59.63",
        "end_time": "60.05",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Masu"
          }
        ],
        "type": "pronunciation"
      }
    ]
  },
  "status": "COMPLETED"
}

Transcription result

The JSON transcript value mentioned above is the entire transcription result.

** Search service Yahoo Japan's parent company Set Holdings and telecommunications app giant Line announced today that they have agreed to merge their operations, and a huge partner company with hundreds of millions of users will be born. did. In the final agreement, Softbank, the parent company of Set Holdings, and Never Korea, the parent company of Line, will create a new company in which Iso% will be invested, and Holdings will be placed under the umbrella of Yahoo and Line as subsidiaries. Kentaro Kawabe, the current president of Set Holdings, will be the representative of Yahoo's parent company, Set Holdings, and Kentaro Kawabe, the president of Yahoo, will join the line with the president and prefectural daughter. From 5 pm, we will open a constitutional amendment and announce the aim of integration, etc. **

Excluding proper nouns and homonyms, are the following conspicuous misconversions?

--Hundreds of millions: 100 million (Note: The news says 100 million) --Giant partner company: Giant IT company --Iso Percentage: 50% --President and CEO: Prefectural daughter: President and Co-CEO --Renewal as CEO: Representative Director and Co-CEO

It seems that Co-CEO was difficult.

I'm not sure about the Iso percent: thinking:

Commentary

items

There are some half-width spaces in the transcript, but in Amazon Transcribe

--Overall transcription results to transcript

--Transcribing results for each part of speech to ʻitems`

It is supposed to be returned, and transcript contains a half-width space to separate these part of speech units.

{
  "jobName": "yl",
  "accountId": "xxxxxxxxxxxx",
  "results": {
    "transcripts": [
      {
        "transcript": "Search service Yahoo Japan(Abbreviation)Announcement"
      }
    ],
    "items": [
      {
        "start_time": "0.04",
        "end_time": "0.54",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Search"
          }
        ],

confidence

In addition, the reliability of the conversion result for each part of speech is stored in confidence.

Try confidence

--If 1.0 (highest value), ** black </ font> ** --If 0.9 or more and less than 1.0, ** dark gray </ font> ** --If less than 0.9 ** light gray </ font> **

I tried to process the obtained JSON file so that it will be displayed in.

import json


def conv_color(confidence: float) -> str:
    if confidence == 1:
        return "black"
    elif confidence >= 0.9:
        return "gray"
    else:
        return "silver"


with open('./transcript.json') as f:
    d = json.load(f)

    for item in d['results']['items']:
        color = conv_color(float(item['alternatives'][0]['confidence']))

        print(f'<font color="{color}">', end='')
        print(item['alternatives'][0]['content'], end='')
        print('</font>', end='')

Then, the result is as follows. The thin part is the part where Amazon Transcribe is not confident in the conversion result.

** Search </ font> Services </ font> Yahoo </ font> Japan </ font> </ font> parent company </ font> set </ font> <font color = "black" "> Holdings </ font> and </ font> Communication </ font> Apps </ font> Major </ font> </ font> Line </ font> is </ font> Today </ font> Management </ font> Integration </ font> Do </ font> that </ font> with </ font> agreement </ font> </ font> font> ta </ font> and </ font> announcement </ font> </ font> , </ font> Better </ font> </ font> Use </ font> Person </ font> Number </ font> Billion </ font> people </ font> scale </ font> </ font> huge </ font> Other party </ font> Company </ font> is </ font> Birth </ font> font> </ font> </ font> Both companies </ font> </ font> Announcement </ font> </ font> </ font> . </ font> Agreement </ font> Draft </ font> with </ font> Is </ font> final </ font> target </ font> to </ font> set </ font> Holdings </ font> </ font> parent company </ font> </ font> Softbank </ font> and </ font> line </ font> < font color = "black"> </ font> parent company </ font> </ font> Korea </ font> > Never </ font> is </ font> Iso </ font> Percent < / font> one by one </ font> investment </ font> </ font> , </ font> new </ font> company </ font> </ font> <font color = "gray" "> Making </ font> Masu </ font> That </ font> Affiliated </ font> <font co </ Font> Holdings </ font> </ font> Place </ font> in lor = "gray"> </ font> Yahoo </ font> and </ font> line </ font> font> to </ font> subsidiary </ font> </ font> </ font> , </ font> </ font> line </ font> And </ font> Yahoo </ font> </ font> Parent company </ font> and </ font> become </ font> set </ font> Holdings </ font> </ font> Representative </ font> </ font> is </ font> < font color = "silver"> Now </ font> </ font> Set </ font> Holdings </ font> > </ font> Kawabe </ font> Kentaro </ font> <font color = "black" "> President </ font> is </ font> Representative </ font> Director </ font> President </ font> Prefecture </ font> Daughter </ font> </ font> Line </ font> </ font> Insert </ font> Zawa </ font> Tsuyoshi </ font> President </ font> is </ font> Representative </ font> font> Director </ font> Rights </ font> Update </ font> </ font> and </ font> </ font> </ font> Joint </ font> with </ font> , </ font> Arrival </ font> Masu </ font> Both companies </ font> is </ font> Today </ font> </ font> Afternoon </ font> Five </ font> Time </ font> < font color From = "gray"> </ font> Constitutional reform </ font> </ font> Open </ font> < font color = "gray"> </ font> integration </ font> </ font> aim </ font> > etc. </ Font> </ font> Announcement </ font> < / font> , </ font> </ font> **

cost

It costs 0.0004 USD for 1 second voice conversion. However, if it is less than 15 seconds, it will cost 15 seconds.

The voice this time is about 1 minute, so it will be about 2 to 3 yen.

There is a 60-minute free usage tier every month for the first year of use.

-Amazon Transcribe Pricing

Finally

Even with the clear voice from the announcer, some misconversions occurred. Amazon Transcribe has the confidence level of the conversion result for each part of speech, so the point may be how to utilize this.

Also, although I have not tried it this time, it is possible to specify the number of speakers in the voice from 2 to 10, and it seems that conversion for each speaker is possible. It would be interesting to transcribe the audio of a TV program or conference.

reference

-Amazon Transcribe --AWS Documents

Recommended Posts