Image classification using the Vision framework

Photo by Erol Ahmed / Unsplash

For the last couple of weeks, I was presenting awesome features of the Vision framework and simple ways of presenting results in the images. They were, I hope, visually appealing and interesting.

In today's article, I won't be presenting anything in the image. No overlays, rectangles, lines, texts. I know this might be disappointing but please don't give up on reading.

The request I want to talk about today might seem modest compared to others previously presented. It returns an array of strings and nothing more.

But the strings in the list are describing the contents of an image.

It's fast, it's simple and it can change the way you handle images in your application, how you classify them, what you can allow users to do and how.

This is the same image I used in the saliency detection article. Last week we were able to detect where users will focus their attention. This week I will show you how to try to understand what is on the image.

Keep reading if you find getting from here:

To here, interesting:

outdoor - 99%
land - 98%
liquid - 98%
water - 98%
water_body - 98%
waterways - 98%
waterfall - 98%
sky - 82%
cloudy - 82%

VNClassifyImageRequest is the request used to classify images. Unlike other requests, I presented this one doesn't provide any points or rectangles. It generates an array of identifiers with associated levels of confidence.

Let's create the request:

let request = VNClassifyImageRequest()

And check what can be identified:

// iOS 15 and up
let supportedIdentifiers = try? request.supportedIdentifiers()

Note: This is for iOS 15 and above.

// Below iOS 15
let supportedIdentifiers = try? VNClassifyImageRequest.knownClassifications(forRevision: VNClassifyImageRequestRevision1)

You can find a complete list of 1303 (VNClassifyImageRequestRevision1) supported identifiers here.

The next step is running the request on a selected image:

let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                           orientation: .init(image.imageOrientation),
                                           options: [:])

DispatchQueue.global(qos: .userInitiated).async {
    do {
        try requestHandler.perform([request])
    } catch {
        print("Can't make the request due to \(error)")
    }

Please check my Detecting body pose using Vision framework article if you need more information on running the requests.

And getting the results:

guard let results = request.results as? [VNClassificationObservation] else { return }

Note: In iOS 15 and above we don't need to map the results anymore. They come with the correct type and not arrays of Any.

In VNClassificationObservation we are interested in:

  • The identifier which holds information about the contents of the image.
  • The confidence which contains values from 0.0 to 1.0 describing whether Vision is certain of this observation or not.

Let's see what the Vision has to say about our image:

results
    .forEach { print("\($0.identifier) - \((Int($0.confidence * 100)))%") }

The result:

outdoor - 99%
land - 98%
liquid - 98%
water - 98%
water_body - 98%
waterways - 98%
waterfall - 98%
sky - 82%
cloudy - 82%
structure - 28%
rocks - 28%
hill - 16%
rainbow - 13%
mountain - 12%
river - 7%
cliff - 7%
grass - 7%
blue_sky - 4%
canyon - 4%
creek - 4%
sunset_sunrise - 2%
plant - 2%
moss - 2%
shrub - 1%
sun - 1%
foliage - 0%
painting - 0%
bridge - 0%
forest - 0%
/* A lot of other identifiers */
xylophone - 0%
yacht - 0%
yarn - 0%
yoga - 0%
yogurt - 0%
yolk - 0%
zebra - 0%
zoo - 0%
zucchini - 0%

The total count of the results is 1303. The same as the number of identifiers supported by the machine learning model used for this request.

It's the time when confidence shines. We filter the results to accept identifiers with confidence larger than 70%:

results
    .filter { $0.confidence > 0.7 }
    .forEach { print("\($0.identifier) - \((Int($0.confidence * 100)))%") }

This gives us the list I presented at the beginning of the article:

outdoor - 99%
land - 98%
liquid - 98%
water - 98%
water_body - 98%
waterways - 98%
waterfall - 98%
sky - 82%
cloudy - 82%

This is the whole code needed to get these results:

func process(_ image: UIImage) {
    guard let cgImage = image.cgImage else { return }
    let request = VNClassifyImageRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])
    
    DispatchQueue.global(qos: .userInitiated).async {
        do {
            try requestHandler.perform([request])
        } catch {
            print("Can't make the request due to \(error)")
        }
        
        guard let results = request.results as? [VNClassificationObservation] else { return }
        
        results
            .filter { $0.confidence > 0.7 }
            .forEach { print("\($0.identifier) - \((Int($0.confidence * 100)))%") }
    }
}

Imagine what happens when you classify all images you have in the application and allow users to filter them by content? If someone wants to find the most beautiful sky - it's done.

And this is just the beginning.

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here. The example code is located in this file.

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or tweet to @tustanowskik

If you want to be up to date and always be first to know what I'm working on tap follow @tustanowskik on Twitter

Thank you for reading!

Kamil Tustanowski

Kamil Tustanowski

I'm an iOS developer dinosaur who remembers times when Objective-C was "the only way", we did memory management by hand and whole iPhones were smaller than screens in current models.
Poland