Summary by Adrian Wilkins-Caruana
When people interact with each other, we use more than just our words to communicate: We also use body language, poses, and facial expressions. When chatting with someone over the phone or via text, misunderstandings can happen because these other communication vectors are absent. This issue is also quite frustrating for Meta, which is trying to create the metaverse, a place where people can communicate with each other through their digital avatars. I think that people will only embrace the metaverse — and that the metaverse will only succeed — if Meta can figure out a way to capture and display all of these subtle non-verbal cues in a way that feels natural. Luckily for Meta, they seem to have figured out the capture part of this equation.
The researchers at Meta have trained a new vision-foundation model for four important human-centric perception tasks: 2D pose estimation, body-part segmentation, depth prediction, and surface-normal prediction. The figure below shows each of these tasks. The gist of their approach is this: first, they pretrained a vision-foundation model on images of people, and then they fine-tuned the model on the four tasks I just mentioned. They named the resulting foundation model Sapiens.
When deciding what kind of data to pretrain Sapiens on, the researchers considered two situations. They could either pretrain it on as much data as they could get their hands on, or they could curate a dataset that only contains images of people. More diverse data could help the model generalize to real-world scenarios, but curating a dataset could help Sapiens become an expert at understanding pictures of people. One of the big claims of their research is that the curated-dataset approach is the right way to go.
Starting with a proprietary dataset of about a billion images, the researchers used a person-detection model to whittle it down to about 300 million images — ones that the detector could confidently say contained a person. They then trained a human-centric, foundation vision transformer (ViT) — a model that has lots of generic information about what humans look like — using the masked autoencoding approach, which teaches the model to reconstruct a partially-masked image. To make this ViT perform tasks like pose estimation and body part segmentation, the researchers appended different “heads” (ML heads, not human heads) to the end of the model to do these things. They also created several variants of Sapiens of different sizes, ranging from 300m to 2b parameters.
The pose-estimation head predicts a heat map — which is like a prediction layer — for each keypoint (second index finger knuckle, left elbow, etc). The prediction value for each point in the map is the model’s estimation of the likelihood that that point is the keypoint. Sapiens predicts 308 of these keypoints, 243 in the face alone! A heatmap can contain predictions for a keypoint at several locations, since this lets it predict multiple people, as shown in the fifth example below. Using so many keypoints helps the model capture a lot more detail than existing pose estimators, which top out at 68 facial keypoints. Compared to the next-best pose estimator, the largest Sapiens variant is ~7% more accurate in terms of average precision and recall.
The Meta team took a similar heat-map approach for segmenting body parts, where Sapiens predicts one layer for each body part. The researchers collected 100k images annotated with 28 body parts, which is an increase from a more standard 20-part set. Again, this approach naturally handles multi-person prediction, like in the first example below. This extended set distinguishes between the upper/lower parts of limbs, and includes more facial details such as teeth and upper/lower lips. The largest Sapiens variant is ~15% more accurate than a DeepLabV3 baseline (a very capable image-segmentation model) in terms of accuracy and IoU (intersection over union, which is a way to measure similarity between a predicted region and the ground truth region it’s trying to match).
I won’t go into the details of Sapeins’ depth- and normal-estimation heads because the researchers followed a similar approach: They extended the current state-of-the-art approach by improving the data quality (as opposed to quantity), and then followed conventional modeling approaches. The researchers showed that Sapiens is more accurate than several baselines for these tasks, though not all baselines were human-centric models.
It’s not surprising that Meta is trying to develop models like Sapiens for human parsing. They are all-in on the metaverse and, in my opinion, it’s critical that participants’ avatars are convincing and lifelike. I think the additional facial details of the pose and body-part models will be critical to achieving this, though I have doubts that a 2b-parameter model can be scaled to predict a user’s body in real time (as opposed to asynchronously, like Sapiens). But Sapiens is still a huge step towards achieving this, and I look forward to seeing how this development improves Meta’s product.