Data-efficient and Accurate Vision Models from Synthetic Data
International Conference on
Computer Vision 2025
To train our models, we use exclusively synthetic data. Specifically, we use the data generation pipeline of Hewitt et al. incorporating the updated face model of Petikam et al. to create a human-centric synthetic dataset with a high degree of realism, as well as high-fidelity ground-truth annotations. Our SynthHuman dataset contains 300K images of resolution 384×512, covering examples of faces, upper body, and full body scenarios equally. We design SynthHuman such that it is diverse in terms of poses, environments, lighting, and appearances, and not tailored to any specific evaluation set. This allows us to train models that generalize across a range of benchmark datasets, as well as on in-the-wild data.
Along with the RGB rendered image, each sample includes soft foreground mask, surface normals, and depth ground-truth annotations, used to train our models.
We use a single model architecture (with varying number of output channels) to tackle the three dense prediction tasks. We adapt the dense prediction transformer (DPT) to handle variable input resolutions efficiently. Using a single dataset and a single model architecture allows us to easily train a single model with three convolution heads to perform multiple task learning. This is particularly important to combine soft foreground segmentation with depth and normal estimation, as for human-centric tasks it is needed to separate the human from the background.
Our human-centric dense prediction model delivers high-quality, detailed results while achieving remarkable efficiency, running orders of magnitude faster than competing methods, with inference speeds as low as 21 milliseconds per frame (the large multi-task model on an NVIDIA A100). It reliably captures a wide range of human characteristics under diverse lighting conditions, preserving fine-grained details such as hair strands and subtle facial features. This demonstrates the model's robustness and accuracy in complex, real-world scenarios.
@misc{saleh2025david, title={{DAViD}: Data-efficient and Accurate Vision Models from Synthetic Data}, author={Fatemeh Saleh and Sadegh Aliakbarian and Charlie Hewitt and Lohit Petikam and Xiao-Xian and Antonio Criminisi and Thomas J. Cashman and Tadas Baltrušaitis}, year={2025}, eprint={2507.15365}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.15365}, }