Tracking a Face in 3D Space with Recolude

Eli Davis
Jan 18, 2022
6 min read

Updated: Jan 19, 2022

Eli Davis - Founder

All source code can be found at github.com/recolude/landmark-recordings

Intro

Here at Recolude, we’re always thinking of ways to push the boundaries of our services. Recently I spent some time exploring a non-traditional use case for us, face tracking.

To start, I took a week to do a better proof of concept for recording the output from a Machine Learning model. I wasn’t really happy with the results from the Pigo library that I’ve tried in the past, and after some searching, I came across Media Pipe. Media Pipe is a quick and easy-to-use Machine Learning Library for converting human landmarks inside an image into 3D coordinates. It has plenty of different pre-trained models you can use out of the box, and with only a few lines of python, you can start playing around with it. The model I’m going to focus on in this article is called Face Mesh. The solution allows you to track 478 different landmarks on a face in 3D space, and the model runs fast enough to work in real-time. This model absolutely wipes the floor with Pigo, providing lip and eyebrow tracking, as well as much better pupil tracking.

Building a Recolude Recording (RAP)

To begin, we use FFmpeg to obtain all the frames from a video. Given a video with the name “in.mp4”, we extract all the images into the folder “frames” with the command:

ffmpeg -i in.mp4 frames/frame_%04d.png -hide_banner

Once all the frames have been extracted, we can then begin processing them using Media Pipe’s python API to extract the 3D information. To get started, you need to install Python 3.9 64-bit. For me, the 64-bit portion is important, as I and others had issues installing media pipe on older 32-bit versions. Once installed, we just need to loop through every face within every frame found in the frames directory. As we examine the faces, we build a JSON file storing the X, Y, and Z coordinates of the different landmarks. When initializing Face Mesh, be sure to set static_image_mode to false and to raise the number of desired faces detected to that found within the video itself. Overall this task can be accomplished with 60 lines of python.

Once we have generated a JSON file, it’s just a matter of reading it in and building a RAP file from it. To build the RAP, we’re going to be using the open-source library provided by Recolude that anyone can use to build recordings programmatically. We start by defining a LandMark struct for unmarshaling our JSON.

Once we have our LandMark struct, it’s really just a matter of loading in the JSON, traversing the file while creating Recolude vector captures, and then writing the captures to a RAP. Because the distance between each position of a given landmark remains within a very small range, we can use the Oct24 positional encoding algorithm to store our data without any detriment to its quality. Oct24 encoding works by examining the positional deltas a landmark makes as it travels through a recording. The encoder then takes advantage of these deltas in such a way that it ends up using only 24 bits per position capture, as opposed to the traditional 192 bits. We will also use BST16 for encoding times which uses a similar strategy to Oct24. For each landmark, we can choose to render it as a small sphere to help us visualize the face using a few metadata properties. Overall, building the recording takes a little more than 100 lines of Golang code.

Mesh Tessellation

Ironically, the hardest part of the whole project was properly tessellating out a 3D mesh. The challenge at hand is taking a random collection of line segments Media pipe provides and transforming it into a proper geometry where all triangles face the “correct” direction. To best explain the algorithm used, let’s take a look at an example collection of line segments.

The basic strategy here is to perform a depth-first search with a max search depth of 3 in an attempt to build triangles. If at the end of the search we’re back at the original vertex, then we’ve successfully made a triangle, and we can add it to our list of triangles. After fully exploring a given vertex’s connections, we mark it complete. When a vertex is marked complete, all other vertices have their connection to it dropped. We drop these segments so that we don’t end up producing duplicate triangles with different winding orders.

As we explore a given vertex’s connections, we add every visited vertex to a queue to be processed next. The algorithm runs until the queue is empty.

Fixing Winding Order

Now that we have a fully tessellated mesh, we’re done, right? Well, when it comes to game engines, no! Generally, a triangle only has one of its sides rendered, a technique known as Face Culling. Because we blindly connected random line segments to make triangles, some end up clockwise, and others counter-clockwise.

We can figure out whether or not a given triangle get’s culled based on the dot product between the triangle’s normal direction and the direction the camera is facing. A dot product of positive one means the two vectors are pointing in the exact same direction, and a dot product of negative one means the two vectors are pointing in the exact opposite direction. We can use the rendering engine’s forward direction to compute whether or not any given triangle is facing towards or away from the camera, and flip accordingly.

This gets us most of the way to a correctly rendered mesh, and for some, this might be good enough! But this personally bugs the hell out of me, so there’s still some work to be done. The issue here is that some of the triangles truly are facing away from the general forward direction! The solution is to use the triangle’s neighbors' facing direction to determine how the triangle itself should face.

We start by building a lookup table to map each triangle to all of its corresponding neighbors. We can be lazy here and define two triangles as “neighbors” by whether or not the two triangles share a common vertex. We then initialize a queue with the triangle we want to use as our basis for “forward” direction and begin processing. To process an item on our queue, we iterate through each of its neighbors. If a neighbor is a triangle that we’ve already processed, we skip it. If the triangle is one that we haven’t processed completely, we compute its normal direction and its inverted normal direction. If the dot product between a triangle’s normal and its neighbor’s normal is less than the dot product between a triangle’s normal and its neighbor’s inverted normal, then we need to flip the neighbor’s vertices to change its facing direction. For each neighbor we process, we add them to the queue to be processed next. We continue to run until the queue is empty. The resulting mesh should have all the triangles facing the correct direction!

Configuring The Webplayer

To configure the Recolude webplayer to render our faces as it does in the screenshots we just need to set certain metadata properties. The first property we’re going to set is “recolude-geom” to “none” on each of the face’s landmarks. We do this to prevent any actual geometry from being used for that specific landmark’s playback.

Next, we need to build a definition for an actual face. This definition will be of type “subject-as-vertices”. With this definition type, each vertex of our mesh references a child recording by using its ID. The child’s position is then used to set the mesh’s vertex position, so that as the child moves throughout the playback, so too does the mesh’s vertices. Therefore, the final property we set for this definition is the “tris” property. The property is a string array in which every 3 elements correspond to different children recording IDs which then make a triangle.

The last step is to configure some other minor metadata to customize the playback how we want it.

Final Thoughts and Conclusions

Overall it’s been a really fun experience figuring out how to get everything working together. This is a great first project for learning how to bridge the real world with spatial playback in Recolude’s webplayer. If you find yourself experimenting with this, please share your results!

Media pipe really was a pleasure to work with, and I am definitely going to be trying to force it into future projects. Implementing the “subject-as-vertices” feature within the Recolude web player was surprisingly straightforward and opened a lot of doors to new use cases. Now that I can animate any mesh I want, I kinda want to try doing some water playback. Another thing everyone’s been telling me to do is to remap the positions of the face onto an entirely different face mesh, which sounds awesome but I’m going to have to save that for a different day. Please reach out with any questions and thanks for reading!