Summary by Adrian Wilkins-Caruana and Tyler Neylon
Instead of holding up your smartphone to video-call your friends in 2d, wouldn’t it be cool if you could put on a VR headset to chat with them in 3d? But there’s a reason VR headsets haven’t become mainstream: Real-time 3d capture and reconstruction is really hard. It can be done, but there are limitations that make the existing methods unsuitable for real-time capture or impractical for everyday use. To address those issues, today’s paper looks at using the 3d Gaussian splatting method that we discussed recently to capture and render 3d avatars (a 3d version of a person) in real time.
Here are a couple pre-existing ways a 3d avatar can be constructed to resemble a specific person: 1) map images from cameras onto a surface that’s been precisely captured using 3d sensors, or 2) train a neural network to represent the 3d scene using images from multiple viewpoints of the scene (in other words, a NeRF-based method, if you’re familiar with NeRFs). While these methods are useful for many other computer vision tasks, the mapping technique involves difficult custom model construction and may not render clothing movement accurately. And the neural network-based approach isn’t fast enough for real-time 3d reconstruction.
Before this paper, the introduction of Gaussian splatting solved the problem of rendering models in real time that were learned from images of a scene. But the splatting approach was designed for static scenes — not avatars that can move around in 3d. But now researchers from Meta have extended the original splatting approach with some clever tricks. The authors attached 3d Gaussians onto a static avatar “cage,” and the system thinks in terms of deforming this cage to match the real-time position of a person. Each component (arms, legs, fingers, etc.) has its deformation applied based on pose data gathered from real-time images of the person. (The pose-gathering method is based on pre-existing work.)
Each body part has its own model that has been trained to map from input pose data to the output deformation data needed to adjust the static avatar into a position matching the person. The full deformation is done with two components: One thinks in terms of a cage structure (closer to a traditional mesh of 3d points) while the other adjusts the splat position and geometry relative to this cage. The cage adjustments capture the large-scale movements, while the splat corrections give the model a way to express small-scale details. The authors argue that using both of these networks (one for cage deforms, and another for splat-in-cage deforms) provides the best reconstruction results.
This deformation method solves the problem of making the avatar move, but there’s a more subtle issue caused by dynamic avatars that deformation doesn’t address: The color of a surface can change with movement. For example, a shirt may have wrinkles or creases that affect its shading, or the upper body may cast shadows on the lower body. The original 3d Gaussian splatting paper offered a model to color each splat in a view-dependent way using an idea called spherical harmonics. While spherical harmonics are good at handling colors that change depending on the view angle, they’re not great at handling colors that change based on pose. In this paper, instead of using spherical harmonics, the authors based the color and opacity of each Gaussian on another lightweight neural network, which they trained to map from pose information to color data.
Together, the first two neural networks predict the shape and position of the Gaussian splats in 3d space, while the last neural network predicts the local color and opacity of them. The output of these networks are shown below on the left and right, respectively.
These “shallow” neural networks are just three-layer multilayer perceptrons with 128 neurons each — which is microscopic compared to LLMs! They’re trained with a sophisticated loss function that has three main components: one that optimizes the Gaussian’s color, one that helps to ensure that Gaussians in different garments (e.g., shirt vs. pants) are physically separated, and one that reduces artifacts that can occur if a Gaussian shrinks a lot, becomes flipped, or otherwise changes dramatically.
The authors ran extensive ablation experiments (omitting part of the system to see the value it contributes) to show why this particular method works best. For example, the image below shows how their method (left) gets worse (in other columns) when parts are removed. The first column uses the full system. The following columns, respectively, use: a simpler deformation model (w/o Cage), a simpler color representation (w/SH), a simpler garment-separation loss function (w/o LGarment), and a simpler loss function that avoids positional artifacts in previously-unseen poses (last column, w/o LNeo).
Because this research comes from Meta, it’s pretty obvious that they want to use this technology to enable real-time video calling in VR headsets. You can see just how impressive this method is by watching videos of some live avatar reconstructions on the paper’s web page. When I watched these videos, I found myself forgetting that I was watching a real-time avatar reconstruction and not a carefully pre-rendered 3d animation! There are still some slight visual glitches, like the bottom hem of a shirt jutting up and down, but for the most part, the level of detail is fantastic: people bending down, pointing fingers, and smiling in a way that really feels lifelike. It’s definitely something I’d like to try out!