This article is [Sumzap Advent Calendar 2020 # 1] This is an article on 12/25 of (https://qiita.com/advent-calendar/2020/sumzap1).
Nice to meet you, I ’m Shirahama from Sumzap. This time, I will introduce various voice recognition processes when creating a voice recognition application with Unity.
The voice recognition engine is software that listens to voice from a microphone and converts it into text. There are multiple functions whose APIs are open to developers, and they can be used from smartphones, PCs, browsers, etc.
The voice recognition engine API is provided by various companies. This time I will use it in the Unity application, so I evaluated it on that assumption
name | type | provider | Feature | Evaluation | |
---|---|---|---|---|---|
Cloud SpeechToText | Web API | Web API provided by Google | ▲ | High accuracy, but slow with Web API | |
Speech Recognizer | Android API | APIs available from Android native apps | ○ | The speed is high and the accuracy is high. Predictive conversion is a little strong | |
Speech Recognition | iOS API | Apple | APIs available from iOS native apps | ◎ | High speed and high accuracy. Faithful conversion |
Azure Speech to Text | iOS/Android/Web | Microsoft | It can be used on various platforms. Made by MicroSoft | ▲ | Did not work well on iOS |
Watson Speech to Text | iOS/Android/Web | IBM | It can be used on various platforms. Made by IBM | ▲ | Low accuracy |
Amazon Transcribe | Web API | Amazon | Web API. Made by Amazon | × | Slow response with Web API |
Web Speech API | Web API | MDN | Web API. Made by MDN | × | Slow response with Web API |
From the above evaluation, using Speech Recognizer (Google), Speech Recognition (Apple), I decided to use it in the form of Android Plugin (Android Speech Recognizer) and iOS Plugin (iOS Speech Recognition) calls from Unity.
Android Plugin wrote PlugIn in Android Studio Implementation example
intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
intent.putExtra(RecognizerIntent.EXTRA_CALLING_PACKAGE, context.getPackageName());
//Language setting
intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, "en-US");
recognizer = SpeechRecognizer.createSpeechRecognizer(context);
recognizer.setRecognitionListener(new RecognitionListener()
{
@Override
public void onResults(Bundle results)
{
// On results.
ArrayList<String> list = results.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION);
String str = "";
for (String s : list)
{
if (str.length() > 0)
{
str += "\n";
}
str += s;
}
UnitySendMessage(callbackTarget, callbackMethod, "onResults\n" + str);
...
** Implementation features ** -Language setting is intent.putExtra (RecognizerIntent.EXTRA_LANGUAGE, "en-US") (English setting in this implementation) -Offline usage settings are also possible (intent.putExtra (RecognizerIntent.EXTRA_PREFER_OFFLINE, true);) However, it is difficult to use because it is necessary to download the language data in advance from the terminal → Android is assumed to be online -To put the audio in standby, recognizer.startListening (intent); -To end the audio standby, recognizer.stopListening ();
important point ・ It is not possible to control the voice recognition timing here after the standby state. → Automatically detects the interruption of voice input on the terminal and automatically runs the recognition process → Therefore, the words spoken during the recognition process are not recognized. → During the recognition process, it is necessary to display on the screen that the recognition process is in progress so that the user does not speak. -There was a problem that the words that occurred just before manually ending the voice standby were not recognized. → If you suddenly end the standby, the previous recognition process will not be returned. → It is necessary to consider creating a silent time artificially by inserting a processing display etc. before the end and ending the process after the recognition runs. ・ It is necessary to get the permission of the microphone before using it.
Unity side processing
if (!Permission.HasUserAuthorizedPermission(Permission.Microphone)){
Permission.RequestUserPermission(Permission.Microphone);
}
ChromeBook has a mechanism that Android application works like M1 MacBook, When it comes to speech recognition, it didn't work. The reason is that in the case of ChromeBook, even if Permission is defined, the voice usage right is not transferred from the OS. (requires android.permission.BIND_VOICE_INTERACTION) I thought it would be good to improve this area in the future.
iOS Plugin wrote PlugIn in Swift. Implementation example
static func startLiveTranscription() throws
{
//Speech recognition request
recognitionReq = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionReq = recognitionReq else {
return
}
recognitionReq.shouldReportPartialResults = false
//Audio session
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
//Status callback to Unity
VoiceRecoSwift.onCallbackStatus("RECORDING");
recognitionTask = recognizer.recognitionTask(with: recognitionReq, resultHandler: { (result, error) in
if let error = error {
audioEngine.stop()
self.recognitionTask = nil
self.recognitionReq = nil
//Status callback to Unity
VoiceRecoSwift.onCallbackStatus(error.localizedDescription as! NSString);
} else {
DispatchQueue.main.async {
let resultString = result?.bestTranscription.formattedString
print(resultString)
if ((result?.isFinal) != nil)
{
//Processing at the end
let resultFinal = result?.bestTranscription.formattedString
print("FINAL:" + resultFinal!)
//Result callback to Unity
VoiceRecoSwift.onCallback(resultFinal! as! NSString);
}
}
}
})
//Microphone input settings
let inputNode = audioEngine.inputNode
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 2048, format: recordingFormat) { (buffer, time) in
recognitionReq.append(buffer)
}
//Volume measurement
SettingVolume()
audioEngine.prepare()
try audioEngine.start()
...
...
** Implementation features ** -Language setting is recognizer = SFSpeechRecognizer (locale: Locale.init (identifier: "en_US"))! (English setting in this implementation) ・ If you have iOS13 or above, you can use it offline (recognitionReq.requiresOnDeviceRecognition = true) Since offline can be used without downloading, accuracy is high, and speed is fast, iOS adopted offline -To end the audio standby, for audioEngine, audioEngine.stop (), audioEngine.inputNode.removeTap (onBus: 0), recognitionReq? .EndAudio ()
important point -In the online mode, there is a limit of 1 minute for one use, and if it exceeds that, recognition will be forcibly terminated. If you are online, you need to manually stop it in order to recognize the read voice. ・ In offline mode, recognition behavior changes It becomes an automatic recognition process similar to Android, and changes to the behavior that the silent state is detected and the recognition process is automatically executed. -In offline mode, the recognition status cannot be acquired. For Android, you can receive an onEndOfSpeech callback while recognizing, but iOS does not have such a callback. It is necessary to measure the input volume to detect silence and call back by yourself Swift implementation example of volume acquisition
//Volume measurement settings
static func SettingVolume(){
//Data format settings
var dataFormat = AudioStreamBasicDescription(
mSampleRate: 44100.0,
mFormatID: kAudioFormatLinearPCM,
mFormatFlags: AudioFormatFlags(kLinearPCMFormatFlagIsBigEndian | kLinearPCMFormatFlagIsSignedInteger | kLinearPCMFormatFlagIsPacked),
mBytesPerPacket: 2,
mFramesPerPacket: 1,
mBytesPerFrame: 2,
mChannelsPerFrame: 1,
mBitsPerChannel: 16,
mReserved: 0)
//Input level setting
var audioQueue: AudioQueueRef? = nil
var error = noErr
error = AudioQueueNewInput(
&dataFormat,
AudioQueueInputCallback as AudioQueueInputCallback,
.none,
.none,
.none,
0,
&audioQueue)
if error == noErr {
self.queue = audioQueue
}
AudioQueueStart(self.queue, nil)
//Get volume settings
var enabledLevelMeter: UInt32 = 1
AudioQueueSetProperty(self.queue, kAudioQueueProperty_EnableLevelMetering, &enabledLevelMeter, UInt32(MemoryLayout<UInt32>.size))
self.timer = Timer.scheduledTimer(timeInterval: 1.0,
target: self,
selector: #selector(DetectVolume(_:)),
userInfo: nil,
repeats: true)
self.timer.fire()
}
//Volume measurement
@objc static func DetectVolume(_ timer: Timer) {
//Volume acquisition
var levelMeter = AudioQueueLevelMeterState()
var propertySize = UInt32(MemoryLayout<AudioQueueLevelMeterState>.size)
AudioQueueGetProperty(
self.queue,
kAudioQueueProperty_CurrentLevelMeterDB,
&levelMeter,
&propertySize)
self.volume = (Int)((levelMeter.mPeakPower + 144.0) * (100.0/144.0))
VoiceRecoSwift.onCallbackVolume(String(self.volume) as NSString);
}
-It is necessary to specify the permission for microphone and voice recognition in the plist of Xcode.
Unity side processing (Editor)
public static string microphoneUsageDescription = "Use a microphone to recognize reading aloud";
public static string speechRecognitionUsageDescription = "Use speech recognition to recognize reading aloud";
#if UNITY_IOS
private static string nameOfPlist = "Info.plist";
private static string keyForMicrophoneUsage = "NSMicrophoneUsageDescription";
private static string keyForSpeechRecognitionUsage = "NSSpeechRecognitionUsageDescription";
#endif
[PostProcessBuild]
public static void ChangeXcodePlist(BuildTarget buildTarget, string pathToBuiltProject) {
#if UNITY_IOS
if (shouldRun && buildTarget == BuildTarget.iOS) {
// Get plist
string plistPath = pathToBuiltProject + "/" + nameOfPlist;
PlistDocument plist = new PlistDocument();
plist.ReadFromString(File.ReadAllText(plistPath));
// Get root
PlistElementDict rootDict = plist.root;
rootDict.SetString(keyForMicrophoneUsage,microphoneUsageDescription);
rootDict.SetString(keyForSpeechRecognitionUsage, speechRecognitionUsageDescription);
// Write to file
File.WriteAllText(plistPath, plist.WriteToString());
}
#endif
}
At first, I started to create it with a simple Web API, When I read a long sentence, it took a long time for the result to be returned, so it was not good. I changed to the policy of making NativePlugIn by myself. I had a hard time with various behaviors different between Android and iOS, but as a result, the quality was satisfactory.
Recommended Posts