Sound classification using the Sound Analysis framework

Photo by Adrien Olichon / Unsplash

Sound means to our applications as little as the images. It's data without any meaning. Users can decode it, enjoy or hate it, understand it. Applications can analyze the sound patterns, reduce pauses, do other transformations and improvements but not understand what the sound file represents.

It's time to change it.

I have written a few articles about using the Vision framework in applications and I'm far from over but I love to experiment and look for new ways of user-device communication and interaction. This brings me to today's topic which is not vision-related but it's like vision but for sounds.

Please meet the Sound Analysis framework.

import SoundAnalysis

Today's code samples will be provided in a playground. I will add a link to this playground at the end of the article.

Sound classification requests are more eager to throw errors than the vision requests therefore we will start with do-catch:

do {
    /* Our classification code will go here */
} catch {
    print("Something went terribly wrong!")
}

And we create a SNClassifySoundRequest request:

let soundClassifyRequest = try SNClassifySoundRequest(classifierIdentifier: .version1)

We initiate SNClassifySoundRequest with the only currently available version - which is 1. But this request is much more flexible:

Alternatively, you identify a custom set of sounds by providing the sound request with a custom Core ML model

The Sound Analysis framework documentation

This means we can train our own machine learning models with custom sounds we want to classify. Let me know whether you would like to know more because this is material for the whole article.

Now let's get back to our built-in, revision1, request and make sure it can classify the sounds we are interested in. This code will print all the supported identifiers:

soundClassifyRequest.knownClassifications
    .enumerated()
    .forEach { index, identifier in print("\(index). \(identifier)") }
0. speech
1. shout
2. yell
3. battle_cry
4. children_shouting
5. screaming
6. whispering
7. laughter
8. baby_laughter
9. giggling
10. snicker
11. belly_laugh
12. chuckle_chortle
13. crying_sobbing
14. baby_crying
15. sigh
16. singing
17. choir_singing
18. yodeling
19. rapping
20. humming
21. whistling
22. breathing
23. snoring
24. gasp
25. cough
26. sneeze
27. nose_blowing
28. person_running
29. person_shuffling
30. person_walking
[...]

The full list contains more than 300 identifiers. You can check it here.

Next, let's add a sound we will analyze to the playground Resources. Please check Working with the Vision framework in the playgrounds for the instruction on how to add resources to the playground.

I used Approaching thunderstorm with light rain from zapsplat.

With the file in place, we need to construct the URL which we can then pass to the SNAudioFileAnalyzer:

let fileUrl = Bundle.main.url(forResource: "storm", withExtension: "mp3")

The URL is optional and will be nil if the file is not located. We need this file for processing and can't do anything without it. We are in the do-catch block therefore we should make use of it. First, we create an error:

enum FileError: Error {
    case notFound
}

And then we throw it if the file is not there:

guard let filePath = Bundle.main.path(forResource: "storm", ofType: "mp3") else { throw FileError.notFound }

Let's move this above the line where we create a request. The is no need for a request if we don't have a file.

The next step is to create the SNAudioFileAnalyzer. This is where we use the file URL:

let audioFileAnalyzer = try SNAudioFileAnalyzer(url: fileUrl)

Analyzer needs a file to analyze. To start the analysis we need to add a request we want the analyzer to use to process the file:

open func add(_ request: SNRequest, withObserver observer: SNResultsObserving) throws

The problem is that it requires an observer which must conform to SNResultsObserving protocol. This protocol consists of three methods:

public protocol SNResultsObserving : NSObjectProtocol {
    func request(_ request: SNRequest, didProduce result: SNResult)
    optional func request(_ request: SNRequest, didFailWithError error: Error)
    optional func requestDidComplete(_ request: SNRequest)
}

You can find more info in the documentation.

This is how the analyzer can communicate the status of the analysis. We need to make a small class conforming to this protocol and NSObject due to the NSObjectProtocol requirement. Which can trigger a wave of memories of Objective-C days in those who remember those times.

Let's conform to the protocol. DidComplete and DidFail are self-explanatory:

final class AudioAnalysisObserver: NSObject, SNResultsObserving {
    func requestDidComplete(_ request: SNRequest) {
        print("Processing completed!")
    }
    
    func request(_ request: SNRequest, didFailWithError error: Error) {
        print("Failed with \(error)")
    }
}

But there is more to do in the function that provides the SNResults:

func request(_ request: SNRequest, didProduce result: SNResult)

The first problem we need to solve is casting the result to the concrete result type. SNResult is an empty protocol. The documentation is here.

The SNClassificationResult is what we need:

guard let result = result as? SNClassificationResult else  { return }

SNClassificationResult contains:

  • classifications - An array of SNClassification which contains information about the identifiers and the confidence (as usual within [0, 1.0] range).
  • timeRange - CMTimeRange informing about the time range of the analysis.

Classifications are sorted therefore if we get the first one we will have the best match:

guard let result = result as? SNClassificationResult,
      let bestClassification = result.classifications.first else  { return }

Next, we get time information from timeRange:

let timeStart = result.timeRange.start.seconds

The last piece is putting this together into a meaningful message and providing it to the user:

print("Found \(bestClassification.identifier) at \(Int((bestClassification.confidence) * 100))% at \(timeStart)s")

This is our observer:

final class AudioAnalysisObserver: NSObject, SNResultsObserving {
    func requestDidComplete(_ request: SNRequest) {
        print("Processing completed!")
    }
    
    func request(_ request: SNRequest, didProduce result: SNResult) {
        guard let result = result as? SNClassificationResult,
              let bestClassification = result.classifications.first else  { return }
        let timeStart = result.timeRange.start.seconds
        
        print("Found \(bestClassification.identifier) at \(Int((bestClassification.confidence) * 100))% at \(timeStart)s")
    }
    
    func request(_ request: SNRequest, didFailWithError error: Error) {
        print("Failed with \(error)")
    }
}

Finally, we can add our request to the analyzer:

let resultsObserver = AudioAnalysisObserver()
try audioFileAnalyzer.add(soundClassifyRequest, withObserver: resultsObserver)

The last part is to start the analysis:

audioFileAnalyzer.analyze()

Please  remember that analyze() is executing synchronously therefore you shouldn't call it from the main thread or it will block the application. More info in the documentation

I'm working in the playground so synchronous is fine. But in the application, it is a no-go.

Additionally:

You can run the same sound analysis request on multiple file analyzers, and each analyzer can process multiple requests.

More in the documentation.

But I'm talking about threads and other things and we are excited about the results. You can find the audio file here and here are the results we received:

Found water at 61% at 0.0s
Found water at 55% at 1.5s
Found water at 59% at 3.0s
Found water at 50% at 4.5s
Found rain at 55% at 6.0s
Found water at 59% at 7.5s
Found water at 55% at 9.0s
Found water at 50% at 10.5s
Found water at 56% at 12.0s
Found water at 53% at 13.5s
Found water at 53% at 15.0s
Found water at 61% at 16.5s
Found water at 53% at 18.0s
Found water at 48% at 19.5s
Found rain at 71% at 21.0s
Found thunderstorm at 71% at 22.5s
Found water at 59% at 24.0s
Found rain at 61% at 25.5s
Found rain at 63% at 27.0s
Found water at 62% at 28.5s
Found water at 65% at 30.0s
Found water at 57% at 31.5s
Found water at 58% at 33.0s
Found water at 49% at 34.5s
Found water at 51% at 36.0s
Found water at 57% at 37.5s
Found rain at 54% at 39.0s
Found thunder at 83% at 40.5s
Found thunderstorm at 67% at 42.0s
Found thunder at 71% at 43.5s
Found thunder at 80% at 45.0s
Found thunder at 78% at 46.5s
Found thunder at 93% at 48.0s
Found thunderstorm at 81% at 49.5s
Found rain at 65% at 51.0s
Found rain at 61% at 52.5s
Found water at 57% at 54.0s
Found water at 49% at 55.5s
Found water at 58% at 57.0s
Found water at 62% at 58.5s
Found water at 56% at 60.0s
Found water at 59% at 61.5s
Found water at 58% at 63.0s
Found water at 53% at 64.5s
Found water at 59% at 66.0s
Found water at 54% at 67.5s
Processing completed!

Moments ago we didn't know anything about the file and now we can make an educated guess that this audio file contains the recording of a storm. This is a huge leap forward which opens a lot of new possibilities. Generating the tags for the files allowing users to filter them is just the beginning.

You can find the code below. Remember about adding a sound file for the analysis:

import UIKit
import SoundAnalysis

/*
 
 Uncomment if you want to see list of identifiers
 
(try? SNClassifySoundRequest(classifierIdentifier: SNClassifierIdentifier.version1))?.knownClassifications
    .enumerated()
    .forEach { index, identifier in print("\(index). \(identifier)") }
*/

final class AudioAnalysisObserver: NSObject, SNResultsObserving {
    func requestDidComplete(_ request: SNRequest) {
        print("Processing completed!")
    }
    
    func request(_ request: SNRequest, didProduce result: SNResult) {
        guard let result = result as? SNClassificationResult,
              let bestClassification = result.classifications.first else  { return }
        let timeStart = result.timeRange.start.seconds
        
        print("Found \(bestClassification.identifier) at \(Int((bestClassification.confidence) * 100))% at \(timeStart)s")
    }
    
    func request(_ request: SNRequest, didFailWithError error: Error) {
        print("Failed with \(error)")
    }
}

enum FileError: Error {
    case notFound
}

do {
    guard let fileUrl = Bundle.main.url(forResource: "storm", withExtension: "mp3") else { throw FileError.notFound }
    let soundClassifyRequest = try SNClassifySoundRequest(classifierIdentifier: SNClassifierIdentifier.version1)

    let audioFileAnalyzer = try SNAudioFileAnalyzer(url: fileUrl)
    let resultsObserver = AudioAnalysisObserver()
    try audioFileAnalyzer.add(soundClassifyRequest, withObserver: resultsObserver)
    
    audioFileAnalyzer.analyze()
} catch {
    print("Something went terribly wrong!")
}

You can find the code here. I added it as a part of my vision demo application since it has the potential to make the analysis of real-time video and sound even better.

⚠️ If you are using M1 and run Xcode using Rosetta playgrounds won't work:

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or tweet to @tustanowskik

If you want to be up to date and always be first to know what I'm working on tap follow @tustanowskik on Twitter

Thank you for reading!

This article was featured in SwiftLee Weekly #88 and Awesome Swift Weekly #283      🎉

If you want to help me stay on my feet during the night when I'm working on my blog - now you can:

Kamil Tustanowski is iOS Dev, blog writer, seeker of new ways of human-machine interaction
Hey 👋If you are seeing this page it means you either read my blog https://cornerbit.tech or play with my code on GitHub https://github.com/ktustanowski.Thank you...
Kamil Tustanowski

Kamil Tustanowski

I'm an iOS developer dinosaur who remembers times when Objective-C was "the only way", we did memory management by hand and whole iPhones were smaller than screens in current models.
Poland