Accepted at CVPR 2026

DRM: Diffusion-based Reward Model
With Step-wise Guidance

TL;DR. A pre-trained diffusion model is itself a powerful reward model. DRM scores any noisy intermediate latent, enabling dense per-step rewards for RL alignment (Step-GRPO) and a training-free quality booster (Step-wise Sampling).

Jaxon Zhang1*, Binxin Yang2, Hubery Yin2, Chen Li2, Jing LYU2
1Peking University    2WeChat Vision, Tencent Inc.
*Work done during an internship at WeChat Vision, Tencent Inc.
Teaser: DRM provides fine-grained reward along the entire denoising trajectory
Comparison between previous reward models and DRM. Existing reward models treat the generation process as a black box, providing only a single, terminal reward based on the final output. Our DRM offers fine-grained rewards for any noisy latent along the entire denoising trajectory.

Contributions

01 Reward Model

A diffusion model is a reward model.

We introduce DRM, repurposing a pre-trained diffusion model as the evaluative backbone. It inherits a deep understanding of aesthetics and composition, and uniquely scores noisy intermediate latents at any stage of generation.

02 Optimization

Step-GRPO: dense per-step rewards.

A reinforcement learning algorithm that resolves the credit-assignment problem in diffusion alignment. Step-GRPO uses DRM's per-step rewards for far more stable and efficient policy optimization than terminal-reward GRPO.

03 Inference

Step-wise Sampling: training-free.

A plug-and-play inference strategy. At each denoising step we explore multiple SDE branches and let DRM greedily steer the trajectory toward higher-quality outcomes — no fine-tuning required.

Method

1. Diffusion-based Reward Model (DRM)

DRM predicts human preferences by leveraging the intermediate features from the Diffusion Transformer (DiT) of a pre-trained diffusion model. We initialize the backbone from SD3.5-Medium (2.5B parameters) and truncate the last three transformer layers so that the parameter count matches HPSv3-2B for fair comparison.

Given a noisy latent $x_{t}$ at timestep $t$, the truncated DiT produces visual features $f_{v} \in \mathbb{R}^{L \times d}$ that are fed into a lightweight reward head: an MLP projects $f_v$ into a lower-dimensional space, after which the features are reshaped into a 2D map and processed by a small convolutional network, followed by pooling and a final linear projection to produce the preference score $s$.

DRM architecture and training pipeline
Overview of the Diffusion-based Reward Model (DRM). (Left) Training pipeline: DRM takes a pair of preferred and dispreferred images, both corrupted with noise at a specific timestep $t$, and predicts their reward scores. The model is optimized via the Bradley–Terry preference loss. (Right) The detailed architecture of our Reward Output Head.

DRM is trained on triplets $(I^{\text{win}}, I^{\text{lose}}, p)$ from HPDv3, Pick-A-Pic and ImageReward (1.4M pairs total). For a randomly sampled timestep $t$, both images are corrupted via the forward diffusion process and scored:

\[ \mathcal{L}_{\text{DRM}} = -\log \sigma\!\left( s^{\text{win}} - s^{\text{lose}} \right). \]

2. Step-wise GRPO (Step-GRPO)

Standard GRPO treats the multi-step generation process as a black box and applies the terminal reward uniformly to every timestep — an imprecise credit assignment that ignores the varying contribution of each step. Because DRM is initialized from a diffusion backbone, it can score noisy latents at any timestep, enabling dense per-step rewards.

GRPO vs. Step-wise GRPO
GRPO vs. Step-wise GRPO. (Left) Naive GRPO relies on a single terminal reward computed at $t{=}0$ and broadcasts it uniformly to all preceding steps. (Right) Step-GRPO branches out $k$ candidate samples at each timestep via SDE and uses DRM to compute a precise, step-specific reward and advantage.

At each reverse-diffusion step $t$, starting from $x_{t+1}$, we draw $k$ SDE candidates $\{x_t^i\}_{i=1}^k$ and score them with DRM. The immediate advantage for each candidate is normalized within the local group:

\[ \hat A_t^i = \frac{R(x_t^i, c) - \mathrm{mean}\!\left(\{R(x_t^i, c)\}_{i=1}^{k}\right)} {\mathrm{std}\!\left(\{R(x_t^i, c)\}_{i=1}^{k}\right)}. \]
Unlike conventional GRPO, this advantage is a local quantity that scores the decision at the current timestep — yielding a much more direct and fine-grained policy-gradient signal.

3. Step-wise Sampling (Training-free)

DRM also doubles as an inference-time quality booster. Conventional deterministic samplers follow a single fixed trajectory, so any early-step error compounds to the final image. Step-wise Sampling implements an “explore-and-select” mechanism: at each timestep we (1) explore by drawing $k$ SDE successors, (2) score each with DRM, and (3) greedily commit to the highest-scoring branch.

Step-wise Sampling overview
Overview of Step-wise Sampling. At each step $t$, we branch into $k$ candidates via SDE. The DRM scores them and the top-scoring latent continues the trajectory.
\[ x_{t-1} = \arg\max_{x_{t-1}^{i}}\; R(x_{t-1}^{i}, c). \]

Experiments

Reward Modeling Accuracy

We evaluate DRM against leading reward models on standard preference benchmarks. Despite using only 2B parameters, DRM significantly outperforms the similarly-sized VLM-based HPSv3-2B and is competitive with the much larger 7B variant — strong evidence that a diffusion backbone is a more parameter-efficient route to reward modeling.

Model Size ImageReward ↑ PickScore ↑ HPDv2 ↑ HPDv3 ↑
CLIP ViT-H/1457.160.865.148.6
Aesthetic Predictor57.456.876.859.9
ImageReward65.161.174.058.6
PickScore61.670.579.865.6
HPSv265.763.883.365.3
MPS67.563.183.564.3
HPSv3-2B2B57.963.680.866.3
HPSv3-7B7B66.872.885.476.9
DRM (Ours)2B 64.1 73.4 82.2 74.0
Preference prediction accuracy (%). Bold = best, underlined = second-best. DRM achieves the best PickScore and second-best HPDv3 with only 2B parameters.

Step-GRPO: Faster Convergence, Higher Ceiling

Reward curves
Reward curves for various RL algorithms optimized using our DRM. Step-GRPO not only reaches a higher final reward but also converges 2.5× faster (in steps) than the standard GRPO baseline.

We RL-finetune SD3.5-Medium with PickScore, HPSv3 or DRM as reward, all under the same Flow-GRPO recipe; for DRM we additionally evaluate the Step-GRPO algorithm.

Method ImageReward ↑ PickScore ↑ HPSv3 ↑
SD3.5-Medium (base)1.0116.768.95
+ PickScore & GRPO1.1416.949.64
+ HPSv3 & GRPO1.1516.909.71
+ DRM & GRPO1.1416.9510.07
+ DRM & Step-GRPO (Ours) 1.17 17.04 10.28
Performance of SD3.5-Medium fine-tuned with different reward models. DRM + Step-GRPO sets a new state-of-the-art on all three benchmarks.
Reward growth vs. steps and GPU hours
Reward curves with steps (left) and GPU hours (right) on the x-axis. Even with the smallest group size $k{=}2$, Step-GRPO already outperforms GRPO, and converges roughly 3.5× faster in GPU-hours.

Qualitative Comparison

Qualitative comparison across reward models
Qualitative comparison of SD3.5-Medium optimized by various reward models. DRM + Step-GRPO produces noticeably finer detail, fewer artifacts and more coherent structures than competing methods.

Step-wise Sampling: Training-free Plug-and-Play

Step-wise Sampling needs no fine-tuning: simply branch into $k$ SDE candidates per denoising step and use DRM to greedily pick the best one. We sweep $k \in \{1, 2, 4, 6\}$ on the same protocol as Step-GRPO.

$k$ Time (s) ↓ ImageReward ↑ PickScore ↑ HPSv3 ↑ LPIPS ↑
1 (base)2.881.0116.768.950.650
25.631.0816.849.020.661
47.751.1416.819.320.663
69.831.1516.939.490.662
Performance of SD3.5-Medium with Step-wise Sampling (512×512, 50 steps, bfloat16). All preference metrics improve with $k$, and LPIPS confirms diversity is preserved.
Step-wise Sampling qualitative results
Step-wise Sampling enhances both prompt fidelity and aesthetic quality. Comparison of SD3.5-Medium outputs without/with Step-wise Sampling on the same prompts.

More Visual Results

More visual comparisons
Additional visual comparisons across baselines and our full method (DRM + Step-GRPO + Step-wise Sampling).
Supplementary gallery
Supplementary gallery generated by our final model.