Speech synthesis using AVSpeechSynthesizer

Photo by Possessed Photography / Unsplash

Do you know what is a dialogue?

Dialogue is a written or spoken conversational exchange between two or more people, and a literary and theatrical form that depicts such an exchange.

Wikipedia

To be honest I have a problem with this definition.

between two or more people

Why people? Why can't we converse with other beings? What is the difference?

This made me think about the Turing test:

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine's ability to render words as speech. If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test results do not depend on the machine's ability to give correct answers to questions, only how closely its answers resemble those a human would give.

Wikipedia

It's a simpler, and actually real, predecessor of the Voight-Kampff test.

When the evaluator is conversing with human it's dialogue. When the evaluator is conversing with a machine it's not a dialogue any more. At least according to Wikipedia. But what happens before the evaluator decides who is a person and who is not? Can we call this part of the test a dialogue? A dialogue with a machine?

I will leave you with this question open. Sorry but I'm not a philosopher.

My previous article was about Speech recognition using the Speech framework. I presented a way we can talk to our applications and make them recognize our speech. It's time to allow the applications to speak back. It's time to give them a voice.

Did you notice the AV prefix in AVSpeechSynthesizer? No new & fancy frameworks this time just good ol' AVFoundation:

import AVFoundation

Let's imagine we are working on a cooking application. We want to allow the user to use the application without touching or even watching the screen. Consider the scenario where we want the application to inform the user that the chicken should be placed in the oven:

Bake the chicken in the oven for fifteen minutes

First, we decide what the application will say by using AVSpeechUtterance:

let englishUtterance = AVSpeechUtterance(string: "Bake the chicken in the oven for fifteen minutes")

I encourage you to immediately add:

englishUtterance.prefersAssistiveTechnologySettings = true

⚠️ There are a few ways we can tweak the way the application will speak our message. But what about the users with disabilities who are using VoiceOver? There is a high chance that the voice won't be identical to the one from the VoiceOver which is confusing and uncomfortable to the user. This line makes sure that when VoiceOver is on our application will use an identical voice.

Next, we create AVSpeechSynthesizer which we will use in a moment to speak our AVSpeechUtterance:

let synthesizer = AVSpeechSynthesizer()

If you prefer a simple approach you can add:

synthesizer.usesApplicationAudioSession = false

But note that:

If the value of this property is false, the capture session makes use of a private AVAudioSession instance for audio recording, which may cause interruption if your app uses its own audio session for playback.

usesApplicationAudioSession documentation

The last part is passing utterance to speech synthesizer:

synthesizer.speak(englishUtterance)

As soon as you do this you will hear the application talking to you.

The code:

let englishUtterance = AVSpeechUtterance(string: "Bake the chicken in the oven for fifteen minutes")
englishUtterance.prefersAssistiveTechnologySettings = true
let synthesizer = AVSpeechSynthesizer()
synthesizer.usesApplicationAudioSession = false
synthesizer.speak(englishUtterance)

Yes. It's that easy.

But that's not all. You can specify concrete language and speech synthesis can speak many different languages. Including the Polish language which I use every day:

let polishUtterance = AVSpeechUtterance(string: "Piecz kurczaka w piekarniku przez piętnaście minut")
polishUtterance.prefersAssistiveTechnologySettings = true
let polishVoice = AVSpeechSynthesisVoice(language: "pl-PL")
polishUtterance.voice = polishVoice
let synthesizer = AVSpeechSynthesizer()
synthesizer.usesApplicationAudioSession = false
synthesizer.speak(polishUtterance)

As you can see we can create a voice matching the language of the text. When you have a voice you need to pass it to the utterance:

let polishVoice = AVSpeechSynthesisVoice(language: "pl-PL")
polishUtterance.voice = polishVoice

You can paste the code samples into a playground to hear how they sound.

AVSpeechUtterance has a few configuration options:

  • rate - Lower values correspond to slower speech, and higher values correspond to faster speech.
  • pitchMultiplier - The baseline pitch the speech synthesizer uses when speaking the utterance.
  • postUtteranceDelay and preUtteranceDelay - When multiple utterances are enqueued these values mark the delays between them. One from the start, the other after the end.
  • volume - The volume of the speech.
  • voice - The voice to be used to read the text. You can use a voice that doesn't match the country of the text but this won't end well.

You can use:

print(AVSpeechSynthesisVoice.speechVoices())

To see available voices:

Language: ar-SA, Name: Maged, Quality: Default [com.apple.ttsbundle.Maged-compact]
Language: cs-CZ, Name: Zuzana, Quality: Default [com.apple.ttsbundle.Zuzana-compact]
Language: da-DK, Name: Sara, Quality: Default [com.apple.ttsbundle.Sara-compact]
Language: de-DE, Name: Anna, Quality: Default [com.apple.ttsbundle.Anna-compact]
Language: el-GR, Name: Melina, Quality: Default [com.apple.ttsbundle.Melina-compact]
Language: en-AU, Name: Karen, Quality: Default [com.apple.ttsbundle.Karen-compact]
Language: en-GB, Name: Daniel, Quality: Default [com.apple.ttsbundle.Daniel-compact]
Language: en-IE, Name: Moira, Quality: Default [com.apple.ttsbundle.Moira-compact]
Language: en-IN, Name: Rishi, Quality: Default [com.apple.ttsbundle.Rishi-compact]
Language: en-US, Name: Samantha, Quality: Default [com.apple.ttsbundle.Samantha-compact]
Language: en-ZA, Name: Tessa, Quality: Default [com.apple.ttsbundle.Tessa-compact]
Language: es-ES, Name: Mónica, Quality: Default [com.apple.ttsbundle.Monica-compact]
Language: es-MX, Name: Paulina, Quality: Default [com.apple.ttsbundle.Paulina-compact]
Language: fi-FI, Name: Satu, Quality: Default [com.apple.ttsbundle.Satu-compact]
Language: fr-CA, Name: Amélie, Quality: Default [com.apple.ttsbundle.Amelie-compact]
Language: fr-FR, Name: Thomas, Quality: Default [com.apple.ttsbundle.Thomas-compact]
Language: he-IL, Name: Carmit, Quality: Default [com.apple.ttsbundle.Carmit-compact]
Language: hi-IN, Name: Lekha, Quality: Default [com.apple.ttsbundle.Lekha-compact]
Language: hu-HU, Name: Mariska, Quality: Default [com.apple.ttsbundle.Mariska-compact]
Language: id-ID, Name: Damayanti, Quality: Default [com.apple.ttsbundle.Damayanti-compact]
Language: it-IT, Name: Alice, Quality: Default [com.apple.ttsbundle.Alice-compact]
Language: ja-JP, Name: Kyoko, Quality: Default [com.apple.ttsbundle.Kyoko-compact]
Language: ko-KR, Name: Yuna, Quality: Default [com.apple.ttsbundle.Yuna-compact]
Language: nl-BE, Name: Ellen, Quality: Default [com.apple.ttsbundle.Ellen-compact]
Language: nl-NL, Name: Xander, Quality: Default [com.apple.ttsbundle.Xander-compact]
Language: no-NO, Name: Nora, Quality: Default [com.apple.ttsbundle.Nora-compact]
Language: pl-PL, Name: Zosia, Quality: Default [com.apple.ttsbundle.Zosia-compact]
Language: pt-BR, Name: Luciana, Quality: Default [com.apple.ttsbundle.Luciana-compact]
Language: pt-PT, Name: Joana, Quality: Default [com.apple.ttsbundle.Joana-compact]
Language: ro-RO, Name: Ioana, Quality: Default [com.apple.ttsbundle.Ioana-compact]
Language: ru-RU, Name: Milena, Quality: Default [com.apple.ttsbundle.Milena-compact]
Language: sk-SK, Name: Laura, Quality: Default [com.apple.ttsbundle.Laura-compact]
Language: sv-SE, Name: Alva, Quality: Default [com.apple.ttsbundle.Alva-compact]
Language: th-TH, Name: Kanya, Quality: Default [com.apple.ttsbundle.Kanya-compact]
Language: tr-TR, Name: Yelda, Quality: Default [com.apple.ttsbundle.Yelda-compact]
Language: zh-CN, Name: Ting-Ting, Quality: Default [com.apple.ttsbundle.Ting-Ting-compact]
Language: zh-HK, Name: Sin-Ji, Quality: Default [com.apple.ttsbundle.Sin-Ji-compact]
Language: zh-TW, Name: Mei-Jia, Quality: Default [com.apple.ttsbundle.Mei-Jia-compact]]

⚠️ You need to set these properties before enqueuing the utterance because setting it afterward has no effect.

This will get you going but it will take a lot more to make your application pass the Turing test.

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or tweet to @tustanowskik

If you want to be up to date and always be the first to know what I'm working on tap follow @tustanowskik on Twitter

Thank you for reading!

If you want to help me stay on my feet during the night when I'm working on my blog - now you can:

Kamil Tustanowski is iOS Dev, blog writer, seeker of new ways of human-machine interaction
Hey 👋If you are seeing this page it means you either read my blog https://cornerbit.tech or play with my code on GitHub https://github.com/ktustanowski.Thank you...
Kamil Tustanowski

Kamil Tustanowski

I'm an iOS developer dinosaur who remembers times when Objective-C was "the only way", we did memory management by hand and whole iPhones were smaller than screens in current models.
Poland