Accepted at ICML 2026

Generalized Discrete Diffusion with Self-Correction

SCDD trains a discrete diffusion language model to revise incorrect visible tokens directly, preserving parallel generation without a remasking step.

Linxuan Wang*1 Ziyi Wang*1 Yikun Bai*1 Wei Deng2 Guang Lin†1 Qifan Song†1

1Purdue University 2Morgan Stanley

* Equal contribution † Equal advising Correspondence: Wei Deng, weideng056@gmail.com

Demo of Self-Correction Without Remasking

Animation of a real SCDD generation trajectory: a 512-token sequence denoising from all [MASK] to text, with decoded tokens revised in place (without remasking) as generation proceeds.

Unlike masked-only denoising, SCDD can move between non-mask states during generation: a visible mistake can be corrected in place while masked positions continue to denoise in parallel.

Remask-free self-correction

During generation, decoded tokens can be revised directly without remasking, instead of being returned to [MASK].

Explicit transitions

Two SNR-informed schedulers offer separate control over absorbing-mask noise and uniform-transition noise.

Learned self-correction ability

Self-correction is learned during pretraining rather than added through post-hoc sampler heuristics.

Generator-Level Comparison

SCDD decouples token correction from masking. The result is a remask-free sampler with closed-form backward dynamics and separately controlled SNRs.

Comparison of MDLM, GIDD, and SCDD.
Model Generator \(\displaystyle R_t(\mathbf z_t,\mathbf z_s),\quad \mathbf z_s \neq \mathbf m\) Self-Correction Remask-Free Closed-form Backward Decoupled SNRs
MDLM \(\displaystyle \frac{\gamma_t'}{\gamma_t}\, \mathbf z_t^\top(\mathbf z_s-\mathbf m) \) × × -
GIDD \(\displaystyle \begin{aligned} &\left( \frac{\gamma_t'}{\gamma_t} + \frac{\rho_t'}{\rho_t} \right) \mathbf z_s^\top\mathbf z_t - \mathbf z_t^\top \Bigg[ \textcolor{red}{\gamma_t\frac{\rho_t'}{\rho_t}}\mathbf u + \left( \textcolor{red}{(1-\gamma_t)\frac{\rho_t'}{\rho_t}} + \frac{\gamma_t'}{\gamma_t} \right)\mathbf m \Bigg] \end{aligned} \) × × ×
SCDD \(\displaystyle \begin{aligned} &\left( \frac{\gamma_t'}{\gamma_t} + \frac{\rho_t'}{\rho_t} \right) \mathbf z_s^\top \mathbf z_t - \mathbf z_t^\top \left( \textcolor{blue}{\frac{\rho_t'}{\rho_t}}\mathbf u + \frac{\gamma_t'}{\gamma_t}\mathbf m \right) \end{aligned} \)

Generative Perplexity

Lower Gen PPL indicates better unconditional text generation quality. SCDD achieves the best value in every reported LM1B and OWT sampling-step column.

Generative perplexity on LM1B and OWT across sampling steps. Lower is better.
Model LM1B OWT
16 32 64 128 256 32 64 128 256 512 1024
MDLM 226.0162.6136.7123.0118.6 169.9123.6104.794.891.988.5
ReMDM-cap (0.01) 222.1157.5127.0108.996.8 166.3120.995.981.773.968.3
ReMDM-confidence 221.1159.5129.8122.8120.4 167.6118.398.187.983.980.5
GIDD+ (pu = 0.1) 171.1146.4134.9131.9128.7 82.171.466.765.064.863.8
GIDD+ (pu = 0.2) 192.7165.5151.9147.3144.8 90.579.075.173.272.071.2
SCDD (pu = 0.1, ours) 159.8133.5119.2113.7108.9 78.671.867.666.063.661.3
SCDD (pu = 0.2, ours) 159.2130.0115.2108.4102.6 74.567.160.759.658.255.7

OWT models use a context length of 512 to match the GIDD setting reported in the paper.

Citation

BibTeX

@article{wang2026generalized,
  title={Generalized Discrete Diffusion with Self-Correction},
  author={Wang, Linxuan and Wang, Ziyi and Bai, Yikun and Deng, Wei and Lin, Guang and Song, Qifan},
  journal={arXiv preprint arXiv:2603.02230},
  year={2026}
}