DAViD

Data-efficient and Accurate Vision Models from Synthetic Data

International Conference on
Computer Vision 2025

Fatemeh Saleh Sadegh Aliakbarian Charlie Hewitt Lohit Petikam Xiao-Xian Antonio Criminisi Thomas J Cashman Tadas Baltrušaitis

Paper arXiv Video Dataset & Models
DAViD also references Michelangelo's David—an iconic symbol of anatomical precision—and the David vs. Goliath story, reflecting our small yet powerful dataset and models.

Abstract

The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy.

SynthHuman: Human-centric Synthetic Data

To train our models, we use exclusively synthetic data. Specifically, we use the data generation pipeline of Hewitt et al. incorporating the updated face model of Petikam et al. to create a human-centric synthetic dataset with a high degree of realism, as well as high-fidelity ground-truth annotations. Our SynthHuman dataset contains 300K images of resolution 384×512, covering examples of faces, upper body, and full body scenarios equally. We design SynthHuman such that it is diverse in terms of poses, environments, lighting, and appearances, and not tailored to any specific evaluation set. This allows us to train models that generalize across a range of benchmark datasets, as well as on in-the-wild data.

Along with the RGB rendered image, each sample includes soft foreground mask, surface normals, and depth ground-truth annotations, used to train our models.


Architecture

We use a single model architecture (with varying number of output channels) to tackle the three dense prediction tasks. We adapt the dense prediction transformer (DPT) to handle variable input resolutions efficiently. Using a single dataset and a single model architecture allows us to easily train a single model with three convolution heads to perform multiple task learning. This is particularly important to combine soft foreground segmentation with depth and normal estimation, as for human-centric tasks it is needed to separate the human from the background.

Results

Our human-centric dense prediction model delivers high-quality, detailed results while achieving remarkable efficiency, running orders of magnitude faster than competing methods, with inference speeds as low as 21 milliseconds per frame (the large multi-task model on an NVIDIA A100). It reliably captures a wide range of human characteristics under diverse lighting conditions, preserving fine-grained details such as hair strands and subtle facial features. This demonstrates the model's robustness and accuracy in complex, real-world scenarios.

BibTeX

@misc{saleh2025david,
    title={{DAViD}: Data-efficient and Accurate Vision Models from Synthetic Data},
    author={Fatemeh Saleh and Sadegh Aliakbarian and Charlie Hewitt and Lohit Petikam and Xiao-Xian and Antonio Criminisi and Thomas J. Cashman and Tadas Baltrušaitis},
    year={2025},
    eprint={2507.15365},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.15365},
}