Detecting body pose, hand pose, and face landmarks using Vision framework

Original photo by Matheus Ferrero on Unsplash

In my last post about detecting body pose using Vision, I explained how to make the requests and how to interpret them. My main interest in Vision is movement analyzing and motion tracking therefore this time I will show you how to make our analysis more complete. I will introduce two new requests for:

  • Hands detection which will allow us to understand where fingers are located in the image.
  • Face landmarks detection which will help us, among other things, understand what is the mood of the person in the image.

In order to display the data in a more meaningful way, I will refactor the rendering code from the previous article. Dots are good for the start but lines are way better for what we need to achieve.

When we are done we will allow our applications to know much more about the contents of the pictures:

Original photo by Jon Ly on Unsplash

Interested?

The Future

When we see a picture above our brain immediately analyzes and recognizes its contents. It's clear as day for us that there is a girl walking in the park. We polish this skill throughout all our lives. It's natural to us and we don't think how complicated is the "internal implementation" in our brain.

Our applications on the other hand see no more than a bunch of ones and zeros there. The only thing they can do is to order these ones and zeros properly and display them in form of the image to the user.

A picture is worth a thousand words

But not for our applications. At least this is how it used to be.

Detect body pose

Last time I presented a way to make our applications "smarter". We have VNDetectHumanBodyPoseRequest implemented and working in the application. This means we can detect body joints. Thanks to this we not only know where the person in a photo is located but additionally, we know what this person is doing or at least we can make assumptions.

Body pose detection is done but let's change how detected joints are rendered. We can do better than the dots. Joints have their position but this is not all. We know, from anatomy, which joints are connected therefore we can draw lines between them.

The rendering code will have a bunch of changes but this code is simple and I don't want to focus on it.

Let's refactor the code and change bodyPoseRequest to a requests array. It will be useful later on:

let requests = [VNDetectHumanBodyPoseRequest()]

We pass the array to the request handler:

try requestHandler.perform(requests)

Now it's time to update the rendering code. First, let's think about what we need:

  • We need to draw a line between connected joints like i.e. wrist and elbow.
  • We need to draw the torso which unlike hands and legs is "closed" which means the first and last points should have a line between them.

This could be achieved in a simpler way but since I know what will be needed in a few moments I want to be ready:

extension UIImage {
    func draw(openPaths: [[CGPoint]]? = nil,
              closedPaths: [[CGPoint]]? = nil,
              points: [CGPoint]? = nil,
              fillColor: UIColor = .white,
              strokeColor: UIColor = .primary,
              radius: CGFloat = 5,
              lineWidth: CGFloat = 2) -> UIImage? {
        let scale: CGFloat = 0
        UIGraphicsBeginImageContextWithOptions(size, false, scale)
        draw(at: CGPoint.zero)
        
        points?.forEach { point in
            let path = UIBezierPath(arcCenter: point,
                                    radius: radius,
                                    startAngle: CGFloat(0),
                                    endAngle: CGFloat(Double.pi * 2),
                                    clockwise: true)
            
            fillColor.setFill()
            strokeColor.setStroke()
            path.lineWidth = lineWidth
            
            path.fill()
            path.stroke()
        }

        openPaths?.forEach { points in
            draw(points: points, isClosed: false, color: strokeColor, lineWidth: lineWidth)
        }

        closedPaths?.forEach { points in
            draw(points: points, isClosed: true, color: strokeColor, lineWidth: lineWidth)
        }

        let newImage = UIGraphicsGetImageFromCurrentImageContext()
        
        UIGraphicsEndImageContext()
        return newImage
    }
    
    private func draw(points: [CGPoint], isClosed: Bool, color: UIColor, lineWidth: CGFloat) {
        let bezierPath = UIBezierPath()
        bezierPath.drawLinePath(for: points, isClosed: isClosed)
        color.setStroke()
        bezierPath.lineWidth = lineWidth
        bezierPath.stroke()
    }
}

extension UIBezierPath {
    func drawLinePath(for points: [CGPoint], isClosed: Bool) {
        points.enumerated().forEach { [unowned self] iterator in
            let index = iterator.offset
            let point = iterator.element

            let isFirst = index == 0
            let isLast = index == points.count - 1
            
            if isFirst {
                move(to: point)
            } else if isLast {
                addLine(to: point)
                move(to: point)
                
                guard isClosed, let firstItem = points.first else { return }
                addLine(to: firstItem)
            } else {
                addLine(to: point)
                move(to: point)
            }
        }
    }
}

There are a few lines but it's dead simple. The main change is that now our draw function is able to draw not only the points. It can draw arms and legs using point arrays passed to "open" paths and it can draw torso using "closed" paths. The only difference is that closed paths automatically draw lines from the first to the last point.

The next step is to make our request return points in a way we can draw them. First, we need to make a protocol. Again, this might look like overkill at this point but its advice from future me:

protocol ResultPointsProviding {
    func pointsProjected(onto image: UIImage) -> [CGPoint]
    func openPointGroups(projectedOnto image: UIImage) -> [[CGPoint]]
    func closedPointGroups(projectedOnto image: UIImage) -> [[CGPoint]]
}

This protocol, when implemented, will make sure we provide all the data needed to draw the results. VNDetectHumanBodyPoseRequest holds the results and therefore it's a good candidate to implement this protocol:

extension VNDetectHumanBodyPoseRequest: ResultPointsProviding {
    func pointsProjected(onto image: UIImage) -> [CGPoint] {
        point(jointGroups: [[.nose, .leftEye, .leftEar, .rightEye, .leftEar,]], projectedOnto: image).flatMap { $0 }
    }
    
    func closedPointGroups(projectedOnto image: UIImage) -> [[CGPoint]] {
        point(jointGroups: [[.neck, .leftShoulder, .leftHip, .root, .rightHip, .rightShoulder]], projectedOnto: image)
    }
    
    func openPointGroups(projectedOnto image: UIImage) -> [[CGPoint]] {
        point(jointGroups: [[.leftShoulder, .leftElbow, .leftWrist],
                            [.rightShoulder, .rightElbow, .rightWrist],
                            [.leftHip, .leftKnee, .leftAnkle],
                            [.rightHip, .rightKnee, .rightAnkle]], projectedOnto: image)
    }
    
    func point(jointGroups: [[VNHumanBodyPoseObservation.JointName]], projectedOnto image: UIImage) -> [[CGPoint]] {
        guard let results = results else { return [] }
        let pointGroups = results.map { result in
            jointGroups
                .compactMap { joints in
                    joints.compactMap { joint in
                        try? result.recognizedPoint(joint)
                    }
                    .filter { $0.confidence > 0.1 }
                    .map { $0.location(in: image) }
                    .map { $0.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height) }
                }
        }
        
        return pointGroups.flatMap { $0 }
    }
}

This got more complicated than the last time we saw it but if you check it out it's nothing fancy. We have one, main, method to produce point groups we want to draw. Which returns an array of arrays of points. Sadly we couldn't use joint groups because the points weren't organized in any way and since our rendering code relies on the ordering of the points we had to make it ourselves. Points function returns points for the face:

[.nose, .leftEye, .leftEar, .rightEye, .rightEar]

Closed point groups return points for the torso:

[.neck, .leftShoulder, .leftHip, .root, .rightHip, .rightShoulder]

Open point groups return points for the arms and legs:

[.leftShoulder, .leftElbow, .leftWrist],
[.rightShoulder, .rightElbow, .rightWrist],
[.leftHip, .leftKnee, .leftAnkle],
[.rightHip, .rightKnee, .rightAnkle]

You can find more information in the documentation.

With this out of the way we can prepare points and point groups for our drawing function:

let resultPointsProviders = requests.compactMap { $0 as? ResultPointsProviding }

let openPointsGroups = resultPointsProviders
    .flatMap { $0.openPointGroups(projectedOnto: image) }

let closedPointsGroups = resultPointsProviders
    .flatMap { $0.closedPointGroups(projectedOnto: image) }

let points = resultPointsProviders
    .flatMap { $0.pointsProjected(onto: image) }

With the protocol in place, it will be super simple to add other requests and render the results.

The last piece is updating the draw function:

self.imageView.image = image.draw(openPaths: openPointsGroups,
                                  closedPaths: closedPointsGroups,
                                  points: points)

I know this was a lot to take in but with all of this code ready and waiting it will be much simpler introducing new requests. Let's run the application and see what happens:

Please excuse me this ruse. This code should and will generate the image with overlay. But I don't want you to see the image yet. That's why I altered the code to present gray background instead.

I would like to demonstrate what difference Vision makes in terms of understanding the images.

Think about an image as a gray plane of nothing. This is what our applications know about the image in terms of content. Let's pretend we "see" the same that the application would "see".

We applied the body pose and received the image above. It's much better than gray nothingness but is it enough?

The torso is located in the middle of the screen. Which means we are facing this person. It's big which means we are standing close. The hands are raised but not above the head. Thanks to the face points we know that person is looking at us and is not turned away.

This is what we know.

But there is a lot we don't. Is this a fighting stance or is this person surrendering? We don't know and the matter is serious. Should we flee, advance, none of the above? It's hard to reason about it without more data. If we could know whether the hands are open or are clenched it should be much easier to decide what to do.

Detect hand pose

VNDetectHumanHandPoseRequest is what we need now. This Vision request detects hands. Two by default but this can be changed by altering the maximumHandCount. To enable this request we need to do two things. Add the request to the requests array:

let requests = [VNDetectHumanBodyPoseRequest(),
                VNDetectHumanHandPoseRequest()]

And make sure the results are properly formatted for our rendering code:

extension VNDetectHumanHandPoseRequest: ResultPointsProviding {
    func pointsProjected(onto image: UIImage) -> [CGPoint] { [] }
    func closedPointGroups(projectedOnto image: UIImage) -> [[CGPoint]] { [] }
    func openPointGroups(projectedOnto image: UIImage) -> [[CGPoint]] {
        point(jointGroups: [[.wrist, .indexMCP, .indexPIP, .indexDIP, .indexTip],
                            [.wrist, .littleMCP, .littlePIP, .littleDIP, .littleTip],
                            [.wrist, .middleMCP, .middlePIP, .middleDIP, .middleTip],
                            [.wrist, .ringMCP, .ringPIP, .ringDIP, .ringTip],
                            [.wrist, .thumbCMC, .thumbMP, .thumbIP, .thumbTip]],
                            projectedOnto: image)
    }
    
    func point(jointGroups: [[VNHumanHandPoseObservation.JointName]], projectedOnto image: UIImage) -> [[CGPoint]] {
        guard let results = results else { return [] }
        let pointGroups = results.map { result in
            jointGroups
                .compactMap { joints in
                    joints.compactMap { joint in
                        try? result.recognizedPoint(joint)
                    }
                    .filter { $0.confidence > 0.1 }
                    .map { $0.location(in: image) }
                    .map { $0.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height) }
                }
        }
        
        return pointGroups.flatMap { $0 }
    }
}

It's similar to the code we used to handle results for body pose request. The difference is we don't need any points and all the paths, consisting of finger-joint points, are open:

[.wrist, .indexMCP, .indexPIP, .indexDIP, .indexTip],
[.wrist, .littleMCP, .littlePIP, .littleDIP, .littleTip],
[.wrist, .middleMCP, .middlePIP, .middleDIP, .middleTip],
[.wrist, .ringMCP, .ringPIP, .ringDIP, .ringTip],
[.wrist, .thumbCMC, .thumbMP, .thumbIP, .thumbTip]

Worth noting is the fact that the thumb has a different joint naming convention than other fingers.

I recommend this WWDC video if you want to know more about body and hand pose requests.

Let's run the application and see what's changed:

Uh oh! This is looking serious! This person is not capitulating. If anything it looks like the opposite. But what if we are wrong? This person may be exercising or making a joke.

It would be awesome if we could see the face.

Detecting face landmarks

In order to detect a face or faces, we need to use VNDetectFaceLandmarksRequest:

let requests = [VNDetectHumanBodyPoseRequest(),
                VNDetectHumanHandPoseRequest(),
                VNDetectFaceLandmarksRequest()]

And, again, make sure the results are properly provided to the rendering code:

extension VNDetectFaceLandmarksRequest: ResultPointsProviding {
    func pointsProjected(onto image: UIImage) -> [CGPoint] { [] }
    
    func openPointGroups(projectedOnto image: UIImage) -> [[CGPoint]] {
        guard let results = results as? [VNFaceObservation] else { return [] }
        let landmarks = results.compactMap { [$0.landmarks?.leftEyebrow,
                                              $0.landmarks?.rightEyebrow,
                                              $0.landmarks?.faceContour,
                                              $0.landmarks?.noseCrest,
                                              $0.landmarks?.medianLine].compactMap { $0 } }
        
        return points(landmarks: landmarks, projectedOnto: image)
    }
    
    func closedPointGroups(projectedOnto image: UIImage) -> [[CGPoint]] {
        guard let results = results as? [VNFaceObservation] else { return [] }
        let landmarks = results.compactMap { [$0.landmarks?.leftEye,
                                              $0.landmarks?.rightEye,
                                              $0.landmarks?.outerLips,
                                              $0.landmarks?.innerLips,
                                              $0.landmarks?.nose].compactMap { $0 } }
        
        return points(landmarks: landmarks, projectedOnto: image)
    }
    
    func points(landmarks: [[VNFaceLandmarkRegion2D]], projectedOnto image: UIImage) -> [[CGPoint]] {
        let faceLandmarks = landmarks.flatMap { $0 }
            .compactMap { landmark in
                landmark.pointsInImage(imageSize: image.size)
                    .map { $0.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height) }
            }
        
        return faceLandmarks
    }
}

Unlike the body and hand requests, this is structured differently. I assume it's because this request is available from iOS 11.0+ and macOS 10.13+ which means it's older than the other two.
The first problem is we need to cast the results from [Any]? to [VNFaceObservation]:

guard let results = results as? [VNFaceObservation] else { return [] }

VNFaceObservation provides roll, yaw, faceCaptureQuality, and face landmarks variables VNFaceLandmarks2D which allow us to access the detected points. Each of the groups has its own variable i.e. leftEye, nose, innerLips, faceContour, and so on.
Landmarks variables return VNFaceLandmarkRegion2D which can give us normalizedPoints, they need conversion, or we can use pointsInImage(imageSize: CGSize) to get points that are projected into image coordinates and ready to use.

We don't need points but this time there are more closed paths than before. Displaying the landmarks is the main reason for adding the open and closed paths. The closed shapes we use for:

[$0.landmarks?.leftEye,
 $0.landmarks?.rightEye,
 $0.landmarks?.outerLips,
 $0.landmarks?.innerLips,
 $0.landmarks?.nose]

And the open we use for:

[$0.landmarks?.leftEyebrow,
 $0.landmarks?.rightEyebrow,
 $0.landmarks?.faceContour,
 $0.landmarks?.noseCrest,
 $0.landmarks?.medianLine]

Before we run the application for the last time let's make sure that when we have face landmarks detected the body pose face points won't be rendered:

var points: [CGPoint]?
let isDetectingFaceLandmarks = requests.filter { ($0 as? VNDetectFaceLandmarksRequest)?.results?.isEmpty == false }.isEmpty == false

points = resultPointsProviders
    .filter { !isDetectingFaceLandmarks || isDetectingFaceLandmarks && !($0 is VNDetectHumanBodyPoseRequest) }
    .flatMap { $0.pointsProjected(onto: image) }

Thanks to this the face landmarks won't overlap with the face in the image.

It's time to get the final piece of the puzzle:

This person isn't looking at us at all. It isn't looking at anything because both of the eyes are closed. The face is tilted upwards and the mouth is open. There is a high chance the person is laughing or yawning. Either way, we should be safe.

If you want to see the image we were analyzing check the links below:
Click here to see the image we were analyzing.
Original photo by bruce mars on Unsplash

I hope this was interesting and entertaining. Vision has more nice features and a real-time video processing article is around the corner or two.

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here. If you want to check the application that was used in this article check version 0.2.0.

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or drop me a line on Twitter.

Thank you for reading!

This article was featured in Awesome Swift Weekly #274 ūüéČ

Kamil Tustanowski

Kamil Tustanowski

I'm an iOS developer dinosaur who remembers times when Objective-C was "the only way", we did memory management by hand and whole iPhones were smaller than screens in current models.
Poland