[SWIFT] Use voice recognition function in Unity

This article is [Sumzap Advent Calendar 2020 # 1] This is an article on 12/25 of (https://qiita.com/advent-calendar/2020/sumzap1).


Nice to meet you, I ’m Shirahama from Sumzap. This time, I will introduce various voice recognition processes when creating a voice recognition application with Unity.

About voice recognition engine

The voice recognition engine is software that listens to voice from a microphone and converts it into text. There are multiple functions whose APIs are open to developers, and they can be used from smartphones, PCs, browsers, etc.

The voice recognition engine API is provided by various companies. This time I will use it in the Unity application, so I evaluated it on that assumption

name type provider Feature Evaluation
Cloud SpeechToText Web API Google Web API provided by Google High accuracy, but slow with Web API
Speech Recognizer Android API Google APIs available from Android native apps The speed is high and the accuracy is high. Predictive conversion is a little strong
Speech Recognition iOS API Apple APIs available from iOS native apps High speed and high accuracy. Faithful conversion
Azure Speech to Text iOS/Android/Web Microsoft It can be used on various platforms. Made by MicroSoft Did not work well on iOS
Watson Speech to Text iOS/Android/Web IBM It can be used on various platforms. Made by IBM Low accuracy
Amazon Transcribe Web API Amazon Web API. Made by Amazon × Slow response with Web API
Web Speech API Web API MDN Web API. Made by MDN × Slow response with Web API

From the above evaluation, using Speech Recognizer (Google), Speech Recognition (Apple), I decided to use it in the form of Android Plugin (Android Speech Recognizer) and iOS Plugin (iOS Speech Recognition) calls from Unity.

About Android Speech Recognizer

Android Plugin wrote PlugIn in Android Studio Implementation example

        intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
        intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
        intent.putExtra(RecognizerIntent.EXTRA_CALLING_PACKAGE, context.getPackageName());

        //Language setting
        intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, "en-US");

        recognizer = SpeechRecognizer.createSpeechRecognizer(context);
        recognizer.setRecognitionListener(new RecognitionListener()
            public void onResults(Bundle results)
               // On results.
               ArrayList<String> list = results.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION);
               String str = "";
               for (String s : list)
                  if (str.length() > 0)
                      str += "\n";
                   str += s;
               UnitySendMessage(callbackTarget, callbackMethod, "onResults\n" + str);

** Implementation features ** -Language setting is intent.putExtra (RecognizerIntent.EXTRA_LANGUAGE, "en-US") (English setting in this implementation) -Offline usage settings are also possible (intent.putExtra (RecognizerIntent.EXTRA_PREFER_OFFLINE, true);) However, it is difficult to use because it is necessary to download the language data in advance from the terminal → Android is assumed to be online -To put the audio in standby, recognizer.startListening (intent); -To end the audio standby, recognizer.stopListening ();

important point ・ It is not possible to control the voice recognition timing here after the standby state. → Automatically detects the interruption of voice input on the terminal and automatically runs the recognition process → Therefore, the words spoken during the recognition process are not recognized. → During the recognition process, it is necessary to display on the screen that the recognition process is in progress so that the user does not speak. -There was a problem that the words that occurred just before manually ending the voice standby were not recognized. → If you suddenly end the standby, the previous recognition process will not be returned. → It is necessary to consider creating a silent time artificially by inserting a processing display etc. before the end and ending the process after the recognition runs. ・ It is necessary to get the permission of the microphone before using it.

Unity side processing

 if (!Permission.HasUserAuthorizedPermission(Permission.Microphone)){

Speech recognition on ChromeBook

ChromeBook has a mechanism that Android application works like M1 MacBook, When it comes to speech recognition, it didn't work. The reason is that in the case of ChromeBook, even if Permission is defined, the voice usage right is not transferred from the OS. (requires android.permission.BIND_VOICE_INTERACTION) I thought it would be good to improve this area in the future.

About iOS Speech Recognition

iOS Plugin wrote PlugIn in Swift. Implementation example

    static func startLiveTranscription() throws
        //Speech recognition request
        recognitionReq = SFSpeechAudioBufferRecognitionRequest()
        guard let recognitionReq = recognitionReq else {
        recognitionReq.shouldReportPartialResults = false
        //Audio session
        let audioSession = AVAudioSession.sharedInstance()
        try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)

        //Status callback to Unity
        recognitionTask = recognizer.recognitionTask(with: recognitionReq, resultHandler: { (result, error) in
          if let error = error {
            self.recognitionTask = nil
            self.recognitionReq = nil
	        //Status callback to Unity
            VoiceRecoSwift.onCallbackStatus(error.localizedDescription as! NSString);
          } else {
            DispatchQueue.main.async {

                let resultString = result?.bestTranscription.formattedString

                if ((result?.isFinal) != nil)
                    //Processing at the end
                    let resultFinal = result?.bestTranscription.formattedString
                    print("FINAL:" + resultFinal!)
			        //Result callback to Unity
                    VoiceRecoSwift.onCallback(resultFinal! as! NSString);
        //Microphone input settings
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 2048, format: recordingFormat) { (buffer, time) in
        //Volume measurement
        try audioEngine.start()

** Implementation features ** -Language setting is recognizer = SFSpeechRecognizer (locale: Locale.init (identifier: "en_US"))! (English setting in this implementation) ・ If you have iOS13 or above, you can use it offline (recognitionReq.requiresOnDeviceRecognition = true) Since offline can be used without downloading, accuracy is high, and speed is fast, iOS adopted offline -To end the audio standby, for audioEngine, audioEngine.stop (), audioEngine.inputNode.removeTap (onBus: 0), recognitionReq? .EndAudio ()

important point -In the online mode, there is a limit of 1 minute for one use, and if it exceeds that, recognition will be forcibly terminated. If you are online, you need to manually stop it in order to recognize the read voice. ・ In offline mode, recognition behavior changes It becomes an automatic recognition process similar to Android, and changes to the behavior that the silent state is detected and the recognition process is automatically executed. -In offline mode, the recognition status cannot be acquired. For Android, you can receive an onEndOfSpeech callback while recognizing, but iOS does not have such a callback. It is necessary to measure the input volume to detect silence and call back by yourself Swift implementation example of volume acquisition

 //Volume measurement settings
   static  func SettingVolume(){
       //Data format settings
       var dataFormat = AudioStreamBasicDescription(
           mSampleRate: 44100.0,
           mFormatID: kAudioFormatLinearPCM,
           mFormatFlags: AudioFormatFlags(kLinearPCMFormatFlagIsBigEndian | kLinearPCMFormatFlagIsSignedInteger | kLinearPCMFormatFlagIsPacked),
           mBytesPerPacket: 2,
           mFramesPerPacket: 1,
           mBytesPerFrame: 2,
           mChannelsPerFrame: 1,
           mBitsPerChannel: 16,
           mReserved: 0)
       //Input level setting
       var audioQueue: AudioQueueRef? = nil
       var error = noErr
       error = AudioQueueNewInput(
        AudioQueueInputCallback as AudioQueueInputCallback,
       if error == noErr {
           self.queue = audioQueue
       AudioQueueStart(self.queue, nil)
       //Get volume settings
       var enabledLevelMeter: UInt32 = 1
       AudioQueueSetProperty(self.queue, kAudioQueueProperty_EnableLevelMetering, &enabledLevelMeter, UInt32(MemoryLayout<UInt32>.size))
       self.timer = Timer.scheduledTimer(timeInterval: 1.0,
                                         target: self,
                                         selector: #selector(DetectVolume(_:)),
                                         userInfo: nil,
                                         repeats: true)
   //Volume measurement
    @objc static func DetectVolume(_ timer: Timer) {
       //Volume acquisition
       var levelMeter = AudioQueueLevelMeterState()
       var propertySize = UInt32(MemoryLayout<AudioQueueLevelMeterState>.size)
       self.volume = (Int)((levelMeter.mPeakPower + 144.0) * (100.0/144.0))
        VoiceRecoSwift.onCallbackVolume(String(self.volume) as NSString);

-It is necessary to specify the permission for microphone and voice recognition in the plist of Xcode.

Unity side processing (Editor)

    public static string microphoneUsageDescription = "Use a microphone to recognize reading aloud";
    public static string speechRecognitionUsageDescription = "Use speech recognition to recognize reading aloud";

	private static string nameOfPlist = "Info.plist";
	private static string keyForMicrophoneUsage = "NSMicrophoneUsageDescription";
	private static string keyForSpeechRecognitionUsage = "NSSpeechRecognitionUsageDescription";

	public static void ChangeXcodePlist(BuildTarget buildTarget, string pathToBuiltProject) {
		if (shouldRun && buildTarget == BuildTarget.iOS) {
			// Get plist
			string plistPath = pathToBuiltProject + "/" + nameOfPlist;
			PlistDocument plist = new PlistDocument();
			// Get root
			PlistElementDict rootDict = plist.root;

			rootDict.SetString(keyForSpeechRecognitionUsage, speechRecognitionUsageDescription);

			// Write to file
			File.WriteAllText(plistPath, plist.WriteToString());

Impressions created

At first, I started to create it with a simple Web API, When I read a long sentence, it took a long time for the result to be returned, so it was not good. I changed to the policy of making NativePlugIn by myself. I had a hard time with various behaviors different between Android and iOS, but as a result, the quality was satisfactory.   img.png

Recommended Posts

Use voice recognition function in Unity
Use java.time in Jackson
Just use OpenVINO's named face recognition demo in Docker
Implement user follow function in Rails (I use Ajax) ②
Use Interceptor in Spring
Use OpenCV in Java
Implement user follow function in Rails (I use Ajax) ①
Use MouseListener in Processing
Use images in Rails
Use PostgreSQL in Scala
Use PreparedStatement in Java
Implement application function in Rails
Implement follow function in Rails
Use ransack Search function implementation
Use ruby variables in javascript.
Use Redis Stream in Java
Use multiple checkboxes in Rails6!
Validation function in Play Framework