Region-Adaptive Sampling for Diffusion Transformers

¹National University of Singapore, ²Microsoft Research

{liuziming, yiqi.zhang, youy}@comp.nus.edu.sg {yifanyang, chengzhang, liliqiu, yuqyang}@microsoft.com

(* indicates corresponding authors)

This work was done during Ziming Liu's internship at Microsoft Research.

Motivation

Diffusion models (DMs) have proven to be highly effective probabilistic generative models, producing high-quality data across various domains. However, generating samples with DMs involves solving a generative Stochastic or Ordinary Differential Equation (SDE/ODE) in reverse time, which requires multiple sequential forward passes through a large neural network. This sequential processing limits their real-time applicability.

Unlike U-Net-based models, DiTs allow any number of tokens to be processed flexibly, opening up new possibilities. This shift has inspired us to design a new sampling approach capable of assigning different sampling steps to different regions within an image. Specifically, we observed two key phenomena: (1) the regions of focus in adjacent steps exhibit considerable continuity during the later stages of diffusion, and (2) in each sampling step, the model tends to focus on specific semantically meaningful areas within the image.

Figure 1: Visualization of predicted noise of each step. DiT model focuses on certain regions during each step and the change in focus is continuous across steps.

Implementation

Figure 2: Overview of RAS design. Only current fast-update regions of each step are passed to the model. And only two extra functions are needed to switch from the original scheduler to RAS.

Built on these insights, Region-Adaptive Sampling is proposed to allow only the regions that the model is focusing on (fast update regions) to proceed through DiT for denoising. Conversely, for regions of less interest (slow-update regions), we reuse the previous step's noise output directly. This approach enables regional variability in sampling steps: areas of interest are updated with higher ratio, while others retain the previous noise output, thus reducing computation. The regions at each step are selected according to the output noise from the previous step, leveraging the inter-step continuity.

We also introduced techniques like starvation prevention, dynamic sampling ratio, accumulated error resetting, key & value recovery, and kernel fusing to further improve the performance of our method. Please refer to our paper for more details.

Figure 3: Visualization of the number of active sampling steps for the regions with RAS. Fast-update regions have more active steps than slow-update regions.

Evaluations

Figure 4: (a)(b) Accelerating Lumina-Next-T2I and Stable Diffusion 3, with 30 and 28 steps separately. (c)(d) Multiple configurations of RAS outperform rectified flow in both image qualities and text-following. RAS-X stands for RAS with X sampling steps in total. (e) RAS achieves comparable human-evaluation results with the default model configuration while achieving around 1.6x speedup.

Qualitative Results

To further evaluate the qualitative performance of RAS, we conducted a human evaluation. We randomly selected 14 prompts from the official research papers and blogs of Stable Diffusion 3 and Lumina, generating two images for each prompt: one using dense inference and the other using RAS, both with the same random seed and default number of timesteps. As shown in Figure 4 (e), 633 out of 1400 votes (45.21%) indicated that the two images were of similar quality. Additionally, 28.29% of votes favored the dense image over the RAS result, while 26.50% preferred RAS over the dense result. These results demonstrate that RAS achieves a significant improvement in throughput (1.625× for Stable Diffusion 3 and 1.561× for Lumina-Next-T2I) without noticeably affecting human preference.

Quantitive Results

Table 1: RAS provides Pareto improvements in throughput and image quality compared to rectified flow.

As is shown in figure 4 (c)(d), RAS consistently pushes the efficiency frontier for rectified-flow-based diffusion, significantly reducing inference time for each timestep while maintaining competitive image quality. Compared to merely lowering the total number of timesteps, RAS yields slower quality degradation, especially when timesteps fall below 10. Moreover, in side-by-side comparisons on Stable Diffusion 3 and Lumina-Next-T2I, RAS frequently achieves a Pareto improvement in throughput and image quality: as is shown in Table 1, for a given speedup target, there almost always exists a RAS configuration that offers better FID, sFID, or CLIP score than the dense-inference baseline. As a result, RAS provides a broader parameter space to efficiently balance throughput, quality, and prompt compatibility for large-scale or latency-critical diffusion applications.

BibTeX

@misc{liu2025regionadaptivesamplingdiffusiontransformers, title={Region-Adaptive Sampling for Diffusion Transformers}, author={Ziming Liu and Yifan Yang and Chengruidong Zhang and Yiqi Zhang and Lili Qiu and Yang You and Yuqing Yang}, year={2025}, eprint={2502.10389}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.10389}, }