Look Ma, no markers

Holistic performance capture without the hassle

ACM Transactions on Graphics

SIGGRAPH Asia 2024

Charlie Hewitt Fatemeh Saleh Sadegh Aliakbarian Lohit Petikam Shideh Rezaeifar Louis Florentin Zafiirah Hosenie Thomas J Cashman Julien Valentin Darren Cosker Tadas Baltrušaitis

Paper arXiv Video Datasets

Abstract

We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.

Holistic Performance Capture

Our approach combines machine-learning models for dense-landmark and parameter prediction with model-fitting to provide a robust, accurate and adaptable system. Our method supports registration of the face, body and hands; in isolation, and together in a single take.

Our parametric model captures body and hand pose, body and face shape, and facial expression.

We can also track tongue articulation and eye gaze.

Our method achieves state-of-the-art results on a number of 3D reconstruction benchmarks.

No Hassle

Motion capture shoots typically require specialist hardware, skilled experts and a lot of time to get right. This can make them expensive and challenging to manage in a tight production schedule. Our method aims to eliminate this inconvenience by providing a marker-less, calibration-free solution that can be used with off-the-shelf hardware. This allows for quick and easy capture of high-quality motion data in a variety of environments.

Using just two uncalibrated mobile-phone cameras we can achieve high quality results in world-space.

Our method even works with a single, moving camera in an unconstrained environment with arbitrary clothing.

Synthetic Datasets

Our method is trained exclusively on synthetic data, generated using a conventional computer graphics pipeline. The three datasets used in the paper are available to download here.

SynthBody can be used for tasks such as skeletal tracking and body pose prediction.

SynthFace can be used for tasks such as facial landmark and head pose prediction or face parsing.

SynthHand can be used for tasks such as hand pose prediction or landmark regression.

BibTeX

@article{hewitt2024look,
    title={Look Ma, no markers: holistic performance capture without the hassle},
    author={Hewitt, Charlie and Saleh, Fatemeh and Aliakbarian, Sadegh and Petikam, Lohit and Rezaeifar, Shideh and Florentin, Louis and Hosenie, Zafiirah and Cashman, Thomas J and Valentin, Julien and Cosker, Darren and Baltru\v{s}aitis, Tadas},
    journal={ACM Transactions on Graphics (TOG)},
    volume={36},
    number={6},
    year={2024},
    publisher={ACM New York, NY, USA},
    articleno={235},
    numpages={12},
}