Speech recognition using the Speech framework

Photo by Jason Rosewell / Unsplash

Interfaces. Interfaces everywhere. We use door handles when we want to go out outside, we use keys if we want to secure something, we use the steering wheel if we need to drive somewhere, we buy a ticket if we travel by bus or train.

Oh and I "almost forgot", we use applications on our smartphones. The screen is an interface itself that hosts sub-interfaces for each application. Each application's look and feel is different. Even if they share similarities thanks to conforming to design guidelines they are different and you need to learn how to use them.

Cognitive load refers to the amount of effort that is exerted or required while reasoning and thinking. Any mental process, from memory to perception to language, creates a cognitive load because it requires energy and effort. When cognitive load is high, thought processes are potentially interfered with. To the UX designer, a common goal when designing interfaces would be to keep users’ cognitive load to a minimum.

Check here if you want to know more

Did you hear about "Don't make me think" or "The Best Interface is No Interface"? These are great books about design. The best about them is that you don't even need to read them to start learning! Read the titles and think about their meaning for a second.

Mindblowing.

Now imagine an application presenting financial data and a user working on a task i.e. comparing reports and making assumptions based on the data. See the poor man touching the screen here and there, copying, binding, adding data to comparison, preparing intermediate results, and so on.

Imagine someone who is cooking a meal and has the recipe opened on an iPad. See this person working on steps of the recipe who needs to constantly wash and dry their hands to scroll the recipe on the screen.

Imagine yourself every time you are confused and annoyed by the application's interface.

Wouldn't it be great if you could tell the application what do you want it to do?

It's not easy and the road is long and bumpy but this doesn't mean we can't start making the first steps.

Please, allow me to introduce the Speech framework. This will be the cornerstone of our no-interface approach:

import Speech

In this article, I will focus on the code needed to make speech recognition. I won't clutter it with the application code. The full code will be linked at the end of the article for you to try on.

I want the code to be easy to understand and use therefore the output of this article will be a functioning SpeechAnalyzer class:

final class SpeechAnalyzer: ObservableObject {
}

It's ObservableObject because I'm using SwiftUI in the demo application.

Since we are interested in "talking" to our applications we will analyze the live audio. We need AVAudioEngine to do that:

private let audioEngine = AVAudioEngine()

Our analyzer will have a simple, easy to use, API:

final class SpeechAnalyzer: ObservableObject {
    @Published var recognizedText: String?
    @Published var isProcessing: Bool = false
    func start() {}
    func stop() {}
}

First, we will tackle the start function. We are working on live audio therefore we want to configure a few things in the audio session:

private var inputNode: AVAudioInputNode?

func start() {
        let audioSession = AVAudioSession.sharedInstance()
        do {
            try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
            try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        } catch {
            print("Couldn't configure the audio session properly")
        }
        
        inputNode = audioEngine.inputNode
}

I don't want to get into too much detail on this code and will just say that .record will make sure the other audio is silenced and .measurement tells the session we want to:

minimize the amount of system-supplied signal processing to input and output signals

We will place taps on AVAudioInputNode in a few seconds.

Now we are finally getting to a place where it starts to be interesting. We make a few vars we will need to handle speech recognition:

private var speechRecognizer: SFSpeechRecognizer?
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?

We start by instantiating SFSpeechRecognizer:

self.speechRecognizer = SFSpeechRecognizer()

Which should use the current locale or we can specify a concrete locale:

self.speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "pl_PL"))

The initializers for SFSpeechRecognizer will fail and return nil if the locale is not supported for speech recognition:

public convenience init?() // Returns speech recognizer with user's current locale, or nil if is not supported

public init?(locale: Locale) // returns nil if the locale is not supported

SFSpeechRecognizer is a central object that facilitates the recognition but there are a few more needed. Next is:

recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

Which allows us to customize the way we want to process the audio. We may choose whether we want to use the full power of Apple servers or process the speech on the device:

recognitionRequest?.requiresOnDeviceRecognition = true
Set this property to true to prevent an SFSpeechRecognitionRequest from sending audio over the network. However, on-device requests won’t be as accurate.

requiresOnDeviceRecognition documentation

Note: This will take effect if:

speechRecognizer.supportsOnDeviceRecognition

Returns true. In other words, this might be possible but is not guaranteed.

On-device speech recognition is available for some languages, but the framework also relies on Apple’s servers for speech recognition. Always assume that performing speech recognition requires a network connection.

Speech framework documentation

The other option is to allow the request to return partially recognized texts. This makes the process smoother because the results are coming right from the start and updating in real-time. If you are not interested you can wait for the final recognition:

recognitionRequest?.shouldReportPartialResults = false
If you want only final results (and you don't care about intermediate results), set this property to false to prevent the system from doing extra work.

shouldReportPartialResults documentation

The next step is that we need to make sure we have everything we need and whether speech recognition is available:

guard let speechRecognizer = speechRecognizer,
      speechRecognizer.isAvailable,
      let recognitionRequest = recognitionRequest,
      let inputNode = inputNode
else {
    assertionFailure("Unable to start the speech recognition!")
    return
}

Note: always make sure whether the speech recognizer can recognize speech for the locale:

speechRecognizer.isAvailable

While testing this on various devices I noticed it wasn't available on iPhone 12 Mini but was working perfectly fine for iPhone XR, iPhone 12 Pro max, or iPad Air.

The time has come to provide audio to our SFSpeechAudioBufferRecognitionRequest:

let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
    recognitionRequest.append(buffer)
}

This code will allow our application to tap into the live audio and pass the audio buffer to the request for speech recognition.

Now we need to create a concrete recognition task:

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { [weak self] result, error in
    self?.recognizedText = result?.bestTranscription.formattedString
    
    guard error != nil || result?.isFinal == true else { return }
    self?.stop()
}

Which is provided by the speechRecognizer and is using the recognitionRequest. It's the final piece of the puzzle that ties everything together. This is the place where we get our results. The result is of type SFSpeechRecognitionResult and we are interested in:

  • bestTranscription - Which returns the SFTranscription that is considered the most accurate. The transcription provides formattedString that returns a string we can use. You can see the other results in transcriptions if you like.
  • isFinal - This indicates whether transcription is final and finished.

We set the result string to our:

@Published var recognizedText: String?

Which in turn provides this value to our application.

We will provide the implementation for self?.stop() in a moment. For now, make a mental note that when there is an error or recognition is final it's a good time to stop the recognition process.

Everything is prepared now and wired together. But there is silence. It's time to bring the sound:

audioEngine.prepare()

do {
    try audioEngine.start()
    isProcessing = true
} catch {
    print("Coudn't start audio engine!")
    stop()
}

First, we tell the AVAudioEngine to prepare, and later we start it and indicate that processing is in progress. If it couldn't start we call stop() to clear the resources.

We used the stop function a few times now. It's a good time to create it:

func stop() {
    recognitionTask?.cancel()
    
    self.audioEngine.stop()
    inputNode?.removeTap(onBus: 0)
    
    isProcessing = false
    
    recognitionRequest = nil
    recognitionTask = nil
    speechRecognizer = nil
    inputNode = nil
}

The purpose of this function is to clear everything that is not needed anymore. It stops the currently running tasks, stops the audio engine, removes the tap on inputNode, informs that analyzer is not processing, and clears the memory.

The availability of speech recognition can change and we need to monitor this state and respond accordingly. Luckily there is a delegate for that: SFSpeechRecognizerDelegate. We need to change our SpeechAnalyzer declaration to implement this protocol:

final class SpeechAnalyzer: NSObject, ObservableObject, SFSpeechRecognizerDelegate

We need to additionally add NSObject because this delegate requires NSObjects to operate.

Hello Objective-C my old friend.

Now we let our SFSpeechRecognizer know we want to be its delegate:

speechRecognizer.delegate = self

Thanks to this when we add:

public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
    if available {
        print("✅ Available")
    } else {
        print("🔴 Unavailable")
        recognizedText = "Text recognition unavailable. Sorry!"
        stop()
    }
}

We will immediately start receiving availability information.

This was a lot to take in. I know. But it's time to see it in action. I made a simple application with a record button and label for the speech recognition result. This application is using our SpeechAnalyzer.

It's time to test. We run the application and the first tap on the button results in a crash:

This app has crashed because it attempted to access privacy-sensitive data without a usage description.  The app's Info.plist must contain an NSMicrophoneUsageDescription key with a string value explaining to the user how the app uses this data.

We need to provide NSMicrophoneUsageDescription key with a description of why we need the microphone access in the Info.plist.

Second run, and second tap on the button. This time the alert where the user can allow, or not, microphone access is presented. We tap allow and... the application crashes again:

Error Domain=kAFAssistantErrorDomain Code=1700 "User denied access to speech recognition" UserInfo={NSLocalizedDescription=User denied access to speech recognition}

The user has to deliberately allow the application to not only use the microphone but also do the speech recognition. We must add NSSpeechRecognitionUsageDescription with description to the Info.plist.

Note: If you can't find the Info.plist file in project navigator tap on the top project and look for the Info tab.

Now when we run the application and tap the record button the alert for speech recognition is shown. Allow the recognition and start talking in English or any other language you created the speech recognizer for. The text will appear above the button.

Note: I did this the easy way to not complicate the example code but SFSpeechRecognizer offers methods to implement the authorization properly:

open class func authorizationStatus() -> SFSpeechRecognizerAuthorizationStatus

open class func requestAuthorization(_ handler: @escaping (SFSpeechRecognizerAuthorizationStatus) -> Void)

I encourage you to make use of these methods in your application.

This is the full code:

final class SpeechAnalyzer: NSObject, ObservableObject, SFSpeechRecognizerDelegate {
    private let audioEngine = AVAudioEngine()
    private var inputNode: AVAudioInputNode?
    private var speechRecognizer: SFSpeechRecognizer?
    private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    private var recognitionTask: SFSpeechRecognitionTask?
    
    @Published var recognizedText: String?
    @Published var isProcessing: Bool = false

    func start() {
        let audioSession = AVAudioSession.sharedInstance()
        do {
            try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
            try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        } catch {
            print("Couldn't configure the audio session properly")
        }
        
        inputNode = audioEngine.inputNode
        
        speechRecognizer = SFSpeechRecognizer()
        print("Supports on device recognition: \(speechRecognizer?.supportsOnDeviceRecognition == true ? "✅" : "🔴")")

        // Force specified locale
        // self.speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "pl_PL"))
        recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
        
        // Disable partial results
        // recognitionRequest?.shouldReportPartialResults = false
        
        // Enable on-device recognition
        // recognitionRequest?.requiresOnDeviceRecognition = true

        guard let speechRecognizer = speechRecognizer,
              speechRecognizer.isAvailable,
              let recognitionRequest = recognitionRequest,
              let inputNode = inputNode
        else {
            assertionFailure("Unable to start the speech recognition!")
            return
        }
        
        speechRecognizer.delegate = self
        
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
            recognitionRequest.append(buffer)
        }

        recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { [weak self] result, error in
            self?.recognizedText = result?.bestTranscription.formattedString
            
            guard error != nil || result?.isFinal == true else { return }
            self?.stop()
        }

        audioEngine.prepare()
        
        do {
            try audioEngine.start()
            isProcessing = true
        } catch {
            print("Coudn't start audio engine!")
            stop()
        }
    }
    
    func stop() {
        recognitionTask?.cancel()
        
        self.audioEngine.stop()
        inputNode?.removeTap(onBus: 0)
        
        isProcessing = false
        
        recognitionRequest = nil
        recognitionTask = nil
        speechRecognizer = nil
        inputNode = nil
    }
    
    public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
        if available {
            print("✅ Available")
        } else {
            print("🔴 Unavailable")
            recognizedText = "Text recognition unavailable. Sorry!"
            stop()
        }
    }
}

Additionally, this is the code for the application:

struct SpeechRecognitionView: View {
    private enum Constans {
        static let recognizeButtonSide: CGFloat = 100
    }
    
    @ObservedObject private var speechAnalyzer = SpeechAnalyzer()
    var body: some View {
        VStack {
            Spacer()
            Text(speechAnalyzer.recognizedText ?? "Tap to begin")
                .padding()
            
            Button {
                toggleSpeechRecognition()
            } label: {
                Image(systemName: speechAnalyzer.isProcessing ? "waveform.circle.fill" : "waveform.circle")
                    .resizable()
                    .frame(width: Constans.recognizeButtonSide,
                           height: Constans.recognizeButtonSide,
                           alignment: .center)
                    .foregroundColor(speechAnalyzer.isProcessing ? .red : .gray)
                    .aspectRatio(contentMode: .fit)
            }
            .padding()
        }
    }
}

private extension SpeechRecognitionView {
    func toggleSpeechRecognition() {
        if speechAnalyzer.isProcessing {
            speechAnalyzer.stop()
        } else {
            speechAnalyzer.start()
        }
    }
}

This is all you need to start communicating verbally with your application.

The quality of this service is at least good enough. I was testing English in both on-device and regular and it was working fine. When I switched to polish, my native language, I was surprised by how accurate the recognition was.

⚠️ Important Note: The on-device recognition is less accurate but it's not limited. The speech recognition over the network is limited:

The current rate limit for the number of SFSpeechRecognitionRequest calls a device can make is 1000 requests per hour. Please note this limit is on the number of requests that a device can make and is not tied to the application making it. This is regardless of the length of audio associated with the request. For a given SFSpeechRecognitionRequest, you are allowed up to one minute of audio per request.

The source

In short, each device can make up to 1k requests per hour. Each request can take up to 1 minute in total. This sounds reasonable but... you should be aware of these limitations.

You can download the demo application here.

Enjoy!

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or tweet to @tustanowskik

If you want to be up to date and always be the first to know what I'm working on tap follow @tustanowskik on Twitter

Thank you for reading!

This article was featured in SwiftLee Weekly #90 🎉

If you want to help me stay on my feet during the night when I'm working on my blog - now you can:

Kamil Tustanowski is iOS Dev, blog writer, seeker of new ways of human-machine interaction
Hey 👋If you are seeing this page it means you either read my blog https://cornerbit.tech or play with my code on GitHub https://github.com/ktustanowski.Thank you...
Kamil Tustanowski

Kamil Tustanowski

I'm an iOS developer dinosaur who remembers times when Objective-C was "the only way", we did memory management by hand and whole iPhones were smaller than screens in current models.
Poland