Remask-free self-correction
During generation, decoded tokens can be revised directly without remasking, instead of being returned to [MASK].
Accepted at ICML 2026
SCDD trains a discrete diffusion language model to revise incorrect visible tokens directly, preserving parallel generation without a remasking step.
1Purdue University 2Morgan Stanley
* Equal contribution † Equal advising Correspondence: Wei Deng, weideng056@gmail.com
During generation, decoded tokens can be revised directly without remasking, instead of being returned to [MASK].
Two SNR-informed schedulers offer separate control over absorbing-mask noise and uniform-transition noise.
Self-correction is learned during pretraining rather than added through post-hoc sampler heuristics.
SCDD decouples token correction from masking. The result is a remask-free sampler with closed-form backward dynamics and separately controlled SNRs.
| Model | Generator \(\displaystyle R_t(\mathbf z_t,\mathbf z_s),\quad \mathbf z_s \neq \mathbf m\) | Self-Correction | Remask-Free | Closed-form Backward | Decoupled SNRs |
|---|---|---|---|---|---|
| MDLM | \(\displaystyle \frac{\gamma_t'}{\gamma_t}\, \mathbf z_t^\top(\mathbf z_s-\mathbf m) \) | × | × | ✓ | - |
| GIDD | \(\displaystyle \begin{aligned} &\left( \frac{\gamma_t'}{\gamma_t} + \frac{\rho_t'}{\rho_t} \right) \mathbf z_s^\top\mathbf z_t - \mathbf z_t^\top \Bigg[ \textcolor{red}{\gamma_t\frac{\rho_t'}{\rho_t}}\mathbf u + \left( \textcolor{red}{(1-\gamma_t)\frac{\rho_t'}{\rho_t}} + \frac{\gamma_t'}{\gamma_t} \right)\mathbf m \Bigg] \end{aligned} \) | ✓ | × | × | × |
| SCDD | \(\displaystyle \begin{aligned} &\left( \frac{\gamma_t'}{\gamma_t} + \frac{\rho_t'}{\rho_t} \right) \mathbf z_s^\top \mathbf z_t - \mathbf z_t^\top \left( \textcolor{blue}{\frac{\rho_t'}{\rho_t}}\mathbf u + \frac{\gamma_t'}{\gamma_t}\mathbf m \right) \end{aligned} \) | ✓ | ✓ | ✓ | ✓ |
Lower Gen PPL indicates better unconditional text generation quality. SCDD achieves the best value in every reported LM1B and OWT sampling-step column.
| Model | LM1B | OWT | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | 32 | 64 | 128 | 256 | 32 | 64 | 128 | 256 | 512 | 1024 | |
| MDLM† | 226.0 | 162.6 | 136.7 | 123.0 | 118.6 | 169.9 | 123.6 | 104.7 | 94.8 | 91.9 | 88.5 |
| ReMDM-cap (0.01) | 222.1 | 157.5 | 127.0 | 108.9 | 96.8 | 166.3 | 120.9 | 95.9 | 81.7 | 73.9 | 68.3 |
| ReMDM-confidence | 221.1 | 159.5 | 129.8 | 122.8 | 120.4 | 167.6 | 118.3 | 98.1 | 87.9 | 83.9 | 80.5 |
| GIDD+ (pu = 0.1) | 171.1 | 146.4 | 134.9 | 131.9 | 128.7 | 82.1 | 71.4 | 66.7 | 65.0 | 64.8 | 63.8 |
| GIDD+ (pu = 0.2) | 192.7 | 165.5 | 151.9 | 147.3 | 144.8 | 90.5 | 79.0 | 75.1 | 73.2 | 72.0 | 71.2 |
| SCDD (pu = 0.1, ours) | 159.8 | 133.5 | 119.2 | 113.7 | 108.9 | 78.6 | 71.8 | 67.6 | 66.0 | 63.6 | 61.3 |
| SCDD (pu = 0.2, ours) | 159.2 | 130.0 | 115.2 | 108.4 | 102.6 | 74.5 | 67.1 | 60.7 | 59.6 | 58.2 | 55.7 |
†OWT models use a context length of 512 to match the GIDD setting reported in the paper.
Citation
@article{wang2026generalized,
title={Generalized Discrete Diffusion with Self-Correction},
author={Wang, Linxuan and Wang, Ziyi and Bai, Yikun and Deng, Wei and Lin, Guang and Song, Qifan},
journal={arXiv preprint arXiv:2603.02230},
year={2026}
}