Towards efficient Diffusion Transformers! We introduce Region-Adaptive Sampling, the first diffusion sampling strategy that allows for regional variability in sampling ratios. Compared to spatially uniform samplers, this flexibility enables our approach to allocate DiT's processing power to the model's current areas of interest, significantly improving generation quality within the same inference budget. With models like Lumina-Next-T2I and Stable Diffusion 3, RAS's fast-region noise updating yields over 2x the acceleration with negligible image quality loss. A user study comparing our method to uniform sampling across various generated cases further shows that our method maintains comparable generation quality at 1.6x the acceleration rate.
Diffusion models (DMs) have proven to be highly effective probabilistic generative models, producing high-quality data across various domains. However, generating samples with DMs involves solving a generative Stochastic or Ordinary Differential Equation (SDE/ODE) in reverse time, which requires multiple sequential forward passes through a large neural network. This sequential processing limits their real-time applicability.
Unlike U-Net-based models, DiTs allow any number of tokens to be processed flexibly, opening up new possibilities. This shift has inspired us to design a new sampling approach capable of assigning different sampling steps to different regions within an image. Specifically, we observed two key phenomena: (1) the regions of focus in adjacent steps exhibit considerable continuity during the later stages of diffusion, and (2) in each sampling step, the model tends to focus on specific semantically meaningful areas within the image.
To further evaluate the qualitative performance of RAS, we conducted a human evaluation. We randomly selected 14 prompts from the official research papers and blogs of Stable Diffusion 3 and Lumina, generating two images for each prompt: one using dense inference and the other using RAS, both with the same random seed and default number of timesteps. As shown in Figure 4 (e), 633 out of 1400 votes (45.21%) indicated that the two images were of similar quality. Additionally, 28.29% of votes favored the dense image over the RAS result, while 26.50% preferred RAS over the dense result. These results demonstrate that RAS achieves a significant improvement in throughput (1.625× for Stable Diffusion 3 and 1.561× for Lumina-Next-T2I) without noticeably affecting human preference.
As is shown in figure 4 (c)(d), RAS consistently pushes the efficiency frontier for rectified-flow-based diffusion, significantly reducing inference time for each timestep while maintaining competitive image quality. Compared to merely lowering the total number of timesteps, RAS yields slower quality degradation, especially when timesteps fall below 10. Moreover, in side-by-side comparisons on Stable Diffusion 3 and Lumina-Next-T2I, RAS frequently achieves a Pareto improvement in throughput and image quality: as is shown in Table 1, for a given speedup target, there almost always exists a RAS configuration that offers better FID, sFID, or CLIP score than the dense-inference baseline. As a result, RAS provides a broader parameter space to efficiently balance throughput, quality, and prompt compatibility for large-scale or latency-critical diffusion applications.
@misc{liu2025regionadaptivesamplingdiffusiontransformers,
title={Region-Adaptive Sampling for Diffusion Transformers},
author={Ziming Liu and Yifan Yang and Chengruidong Zhang and Yiqi Zhang and Lili Qiu and Yang You and Yuqing Yang},
year={2025},
eprint={2502.10389},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.10389},
}