3D Face Reconstruction with Dense Landmarks

European Conference on
Computer Vision 2022

Erroll Wood Tadas Baltrušaitis Charlie Hewitt Matthew Johnson Jingjing Shen Nikola Milosavljevic Daniel Wilde Stephan Garbin Chirag Raman Jamie Shotton Toby Sharp Ivan Stojiljkovic Thomas J. Cashman Julien Valentin

Paper arXiv Video


Landmarks often play a key role in face analysis, but many aspects of identity or expression cannot be represented by sparse landmarks alone. Thus, in order to reconstruct faces more accurately, landmarks are often combined with additional signals like depth images or techniques like differentiable rendering.

Can we keep things simple by just using more landmarks?

In answer, we present the first method that accurately predicts ten times as many landmarks as usual, covering the whole head, including the eyes and teeth. This is accomplished using synthetic training data, which guarantees perfect landmark annotations. By fitting a morphable model to these dense landmarks, we achieve state-of-the-art results for monocular 3D face reconstruction in the wild. We show that dense landmarks are an ideal signal for integrating face shape information across frames by demonstrating accurate and expressive facial performance capture in both monocular and multi-view scenarios. Finally, our method is highly efficient: we can predict dense landmarks and fit our 3D face model at over 150FPS on a single CPU thread.

1) Landmark regression

We first predict probabilistic dense landmarks L, each with position µ and certainty σ.

2) Model-fitting

Then, we fit our 3D face model to L, minimizing an energy E by optimizing model parameters Φ

Synthetic data

While a human might consistently label images with 68 landmarks, manually annotating images with dense landmarks would be impossible. Instead, we rendered 100,000 synthetic training images using our Face Synthetics system. Without the perfect annotations provided by synthetic data, dense landmark prediction would not be possible.

Multi-view fitting

Dense landmarks are the ideal signal for markerless multi-view facial performance capture.

Real-time monocular fitting

Our approach is also highly efficient, running in real time on low-power CPU-only laptops.


    doi = {10.48550/ARXIV.2204.02776},
    url = {https://arxiv.org/abs/2204.02776},
    author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Hewitt, Charlie and Johnson, Matthew and Shen, Jingjing and Milosavljevic, Nikola and Wilde, Daniel and Garbin, Stephan and Raman, Chirag and Shotton, Jamie and Sharp, Toby and Stojiljkovic, Ivan and Cashman, Tom and Valentin, Julien},
    keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {3D face reconstruction with dense landmarks},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}