This article is [Sumzap Advent Calendar 2020 # 1] This is an article on 12/25 of (https://qiita.com/advent-calendar/2020/sumzap1).

Introduction

Nice to meet you, I ’m Shirahama from Sumzap. This time, I will introduce various voice recognition processes when creating a voice recognition application with Unity.

About voice recognition engine

The voice recognition engine is software that listens to voice from a microphone and converts it into text. There are multiple functions whose APIs are open to developers, and they can be used from smartphones, PCs, browsers, etc.

The voice recognition engine API is provided by various companies. This time I will use it in the Unity application, so I evaluated it on that assumption

name	type	provider	Feature	Evaluation
Cloud SpeechToText	Web API	Google	Web API provided by Google	▲	High accuracy, but slow with Web API
Speech Recognizer	Android API	Google	APIs available from Android native apps	○	The speed is high and the accuracy is high. Predictive conversion is a little strong
Speech Recognition	iOS API	Apple	APIs available from iOS native apps	◎	High speed and high accuracy. Faithful conversion
Azure Speech to Text	iOS/Android/Web	Microsoft	It can be used on various platforms. Made by MicroSoft	▲	Did not work well on iOS
Watson Speech to Text	iOS/Android/Web	IBM	It can be used on various platforms. Made by IBM	▲	Low accuracy
Amazon Transcribe	Web API	Amazon	Web API. Made by Amazon	×	Slow response with Web API
Web Speech API	Web API	MDN	Web API. Made by MDN	×	Slow response with Web API

From the above evaluation, using Speech Recognizer (Google), Speech Recognition (Apple), I decided to use it in the form of Android Plugin (Android Speech Recognizer) and iOS Plugin (iOS Speech Recognition) calls from Unity.

About Android Speech Recognizer

Android Plugin wrote PlugIn in Android Studio Implementation example



        intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
        intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
        intent.putExtra(RecognizerIntent.EXTRA_CALLING_PACKAGE, context.getPackageName());

        //Language setting
        intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, "en-US");

        recognizer = SpeechRecognizer.createSpeechRecognizer(context);
        recognizer.setRecognitionListener(new RecognitionListener()
        {
            @Override
            public void onResults(Bundle results)
            {
               // On results.
               ArrayList<String> list = results.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION);
               String str = "";
               for (String s : list)
               {
                  if (str.length() > 0)
                  {
                      str += "\n";
                   }
                   str += s;
                }
               UnitySendMessage(callbackTarget, callbackMethod, "onResults\n" + str);
               ...

** Implementation features ** -Language setting is intent.putExtra (RecognizerIntent.EXTRA_LANGUAGE, "en-US") (English setting in this implementation) -Offline usage settings are also possible (intent.putExtra (RecognizerIntent.EXTRA_PREFER_OFFLINE, true);) However, it is difficult to use because it is necessary to download the language data in advance from the terminal → Android is assumed to be online -To put the audio in standby, recognizer.startListening (intent); -To end the audio standby, recognizer.stopListening ();

important point ・ It is not possible to control the voice recognition timing here after the standby state. → Automatically detects the interruption of voice input on the terminal and automatically runs the recognition process → Therefore, the words spoken during the recognition process are not recognized. → During the recognition process, it is necessary to display on the screen that the recognition process is in progress so that the user does not speak. -There was a problem that the words that occurred just before manually ending the voice standby were not recognized. → If you suddenly end the standby, the previous recognition process will not be returned. → It is necessary to consider creating a silent time artificially by inserting a processing display etc. before the end and ending the process after the recognition runs. ・ It is necessary to get the permission of the microphone before using it.

Unity side processing


 if (!Permission.HasUserAuthorizedPermission(Permission.Microphone)){
    Permission.RequestUserPermission(Permission.Microphone);
 }

Speech recognition on ChromeBook

ChromeBook has a mechanism that Android application works like M1 MacBook, When it comes to speech recognition, it didn't work. The reason is that in the case of ChromeBook, even if Permission is defined, the voice usage right is not transferred from the OS. (requires android.permission.BIND_VOICE_INTERACTION) I thought it would be good to improve this area in the future.

About iOS Speech Recognition

iOS Plugin wrote PlugIn in Swift. Implementation example



    static func startLiveTranscription() throws
    {
        //Speech recognition request
        recognitionReq = SFSpeechAudioBufferRecognitionRequest()
        
        guard let recognitionReq = recognitionReq else {
          return
        }
        recognitionReq.shouldReportPartialResults = false
        
        //Audio session
        let audioSession = AVAudioSession.sharedInstance()
        try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)

        //Status callback to Unity
        VoiceRecoSwift.onCallbackStatus("RECORDING");
        
        recognitionTask = recognizer.recognitionTask(with: recognitionReq, resultHandler: { (result, error) in
          if let error = error {
            audioEngine.stop()
            self.recognitionTask = nil
            self.recognitionReq = nil
	        //Status callback to Unity
            VoiceRecoSwift.onCallbackStatus(error.localizedDescription as! NSString);
          } else {
            DispatchQueue.main.async {

                let resultString = result?.bestTranscription.formattedString
                print(resultString)

                if ((result?.isFinal) != nil)
                {
                    //Processing at the end
                    let resultFinal = result?.bestTranscription.formattedString
                    
                    print("FINAL:" + resultFinal!)
			        //Result callback to Unity
                    VoiceRecoSwift.onCallback(resultFinal! as! NSString);
                }
            }
          }
       })
        
        //Microphone input settings
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 2048, format: recordingFormat) { (buffer, time) in
          recognitionReq.append(buffer)
        }
    
        //Volume measurement
        SettingVolume()
        
        audioEngine.prepare()
        try audioEngine.start()
        ...
               ...

** Implementation features ** -Language setting is recognizer = SFSpeechRecognizer (locale: Locale.init (identifier: "en_US"))! (English setting in this implementation) ・ If you have iOS13 or above, you can use it offline (recognitionReq.requiresOnDeviceRecognition = true) Since offline can be used without downloading, accuracy is high, and speed is fast, iOS adopted offline -To end the audio standby, for audioEngine, audioEngine.stop (), audioEngine.inputNode.removeTap (onBus: 0), recognitionReq? .EndAudio ()

important point -In the online mode, there is a limit of 1 minute for one use, and if it exceeds that, recognition will be forcibly terminated. If you are online, you need to manually stop it in order to recognize the read voice. ・ In offline mode, recognition behavior changes It becomes an automatic recognition process similar to Android, and changes to the behavior that the silent state is detected and the recognition process is automatically executed. -In offline mode, the recognition status cannot be acquired. For Android, you can receive an onEndOfSpeech callback while recognizing, but iOS does not have such a callback. It is necessary to measure the input volume to detect silence and call back by yourself Swift implementation example of volume acquisition


 //Volume measurement settings
   static  func SettingVolume(){
       //Data format settings
       var dataFormat = AudioStreamBasicDescription(
           mSampleRate: 44100.0,
           mFormatID: kAudioFormatLinearPCM,
           mFormatFlags: AudioFormatFlags(kLinearPCMFormatFlagIsBigEndian | kLinearPCMFormatFlagIsSignedInteger | kLinearPCMFormatFlagIsPacked),
           mBytesPerPacket: 2,
           mFramesPerPacket: 1,
           mBytesPerFrame: 2,
           mChannelsPerFrame: 1,
           mBitsPerChannel: 16,
           mReserved: 0)
       
       //Input level setting
       var audioQueue: AudioQueueRef? = nil
       var error = noErr
       error = AudioQueueNewInput(
           &dataFormat,
        AudioQueueInputCallback as AudioQueueInputCallback,
        .none,
           .none,
           .none,
           0,
           &audioQueue)
       
       if error == noErr {
           self.queue = audioQueue
       }
       
       AudioQueueStart(self.queue, nil)
       
       //Get volume settings
       var enabledLevelMeter: UInt32 = 1
       AudioQueueSetProperty(self.queue, kAudioQueueProperty_EnableLevelMetering, &enabledLevelMeter, UInt32(MemoryLayout<UInt32>.size))
       
       self.timer = Timer.scheduledTimer(timeInterval: 1.0,
                                         target: self,
                                         selector: #selector(DetectVolume(_:)),
                                         userInfo: nil,
                                         repeats: true)
       self.timer.fire()
       
   }
    
   //Volume measurement
    @objc static func DetectVolume(_ timer: Timer) {
       //Volume acquisition
       var levelMeter = AudioQueueLevelMeterState()
       var propertySize = UInt32(MemoryLayout<AudioQueueLevelMeterState>.size)
       
       AudioQueueGetProperty(
           self.queue,
           kAudioQueueProperty_CurrentLevelMeterDB,
           &levelMeter,
           &propertySize)
       
       self.volume = (Int)((levelMeter.mPeakPower + 144.0) * (100.0/144.0))
       
        VoiceRecoSwift.onCallbackVolume(String(self.volume) as NSString);
    }

-It is necessary to specify the permission for microphone and voice recognition in the plist of Xcode.

Unity side processing (Editor)


    public static string microphoneUsageDescription = "Use a microphone to recognize reading aloud";
    public static string speechRecognitionUsageDescription = "Use speech recognition to recognize reading aloud";

	#if UNITY_IOS
	private static string nameOfPlist = "Info.plist";
	private static string keyForMicrophoneUsage = "NSMicrophoneUsageDescription";
	private static string keyForSpeechRecognitionUsage = "NSSpeechRecognitionUsageDescription";
	#endif

	[PostProcessBuild]
	public static void ChangeXcodePlist(BuildTarget buildTarget, string pathToBuiltProject) {
		#if UNITY_IOS
		if (shouldRun && buildTarget == BuildTarget.iOS) {
			// Get plist
			string plistPath = pathToBuiltProject + "/" + nameOfPlist;
			PlistDocument plist = new PlistDocument();
			plist.ReadFromString(File.ReadAllText(plistPath));
			// Get root
			PlistElementDict rootDict = plist.root;

			rootDict.SetString(keyForMicrophoneUsage,microphoneUsageDescription);
			rootDict.SetString(keyForSpeechRecognitionUsage, speechRecognitionUsageDescription);

			// Write to file
			File.WriteAllText(plistPath, plist.WriteToString());
		}
		#endif
	}

Impressions created

At first, I started to create it with a simple Web API, When I read a long sentence, it took a long time for the result to be returned, so it was not good. I changed to the policy of making NativePlugIn by myself. I had a hard time with various behaviors different between Android and iOS, but as a result, the quality was satisfactory.