TL;DR
Continuous diffusion and flow-based language models such as Embedded Language Flows (ELF) have demonstrated their potential for language generation with a small number of sampling steps: ELF generates length-1024 sequences in 8-32 steps and outperforms prior discrete and continuous diffusion language models that use distillation. However, ELF's performance degrades as sampling steps are further reduced, and few-step or even one-step generation remains a challenge for ELF.
We introduce ELF with progressive distillation (ELF+PD), which distills a pretrained ELF teacher into a student model for few-step language generation. After a five-round distillation curriculum, ELF+PD achieves strong performance across 1-32 sampling steps, outperforming distilled discrete and continuous DLM baselines. Our results demonstrate the potential of continuous DLMs with distillation for fast language generation.
1From many steps to few steps
Diffusion and flow-based language models are appealing because they allow flexible generation: in any order, with any number of steps. In theory that means fast generation; however, in practice these models often require many iterative denoising steps.
Embedded Language Flow (ELF) [1] is a language model in a continuous embedding space, based on continuous-time Flow Matching. It already pushes generation to a small number of steps, e.g., 8-32, without using distillation. But further reducing the sampling budget, toward a handful of steps or a single one, degrades ELF's performance. Few-step, or even one-step, generation remains challenging for diffusion language models.
Progressive distillation [2] has been shown to substantially reduce the number of sampling steps while preserving generation quality. Motivated by this approach, we introduce ELF with progressive distillation (ELF+PD), distilling a trained ELF model into a student that replaces many denoising steps with a single jump.
2Background
Embedded Language Flow
ELF is formulated in continuous embedding space using Flow Matching: it performs denoising primarily in this space, and converts clean embeddings back to discrete tokens only at the final step. With a clean embedding \(\mathbf{x}\) and noise \(\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})\), the noisy latent is defined by linear interpolation,
\[ \mathbf{z}_t = t\,\mathbf{x} + (1-t)\,\boldsymbol{\epsilon}, \qquad t \in [0,1], \]so \(\mathbf{z}_0\) is pure noise and \(\mathbf{z}_1\) is clean data. The instantaneous flow velocity is the time derivative of the path \(\mathbf{v} = \tfrac{d\mathbf{z}_t}{dt} = \mathbf{x}-\boldsymbol{\epsilon}\). Following prior work [3], ELF predicts the clean embeddings \(\mathbf{x}\) (\(\mathbf{x}\)-prediction) and trains with
\[ \mathcal{L}_{\text{MSE}} = \mathbb{E}_{t,\mathbf{x},\boldsymbol{\epsilon}} \big\| \mathbf{v}_\theta(\mathbf{z}_t,t) - \mathbf{v} \big\|^2 = \mathbb{E}_{t,\mathbf{x},\boldsymbol{\epsilon}} \frac{1}{(1-t)^2} \big\| \mathbf{x}_\theta(\mathbf{z}_t,t) - \mathbf{x} \big\|^2 , \]where \(\mathbf{v}_\theta = (\mathbf{x}_\theta-\mathbf{z}_t)/(1-t)\).
ELF uses a shared-weight denoiser and decoder: the same network is trained with two branches: the denoising branch is trained with the MSE loss above to predict clean embeddings, and the decoding branch is trained with a cross-entropy loss to predict discrete tokens. The mode is controlled by special mode tokens.
At inference, ELF iteratively transforms noisy samples into clean embeddings by solving the ODE using a numerical (e.g., Euler) solver, and decodes the final embeddings back to discrete tokens by switching to the decoding mode.
Why predict x, and why one shared network?
Because \(\mathbf{x}\)-prediction outputs the clean embeddings, this naturally aligns the objective of denoising with the objective of predicting clean discrete tokens at the final step. Therefore, ELF does not require a separate decoder. Instead, it shares weights between the denoiser and decoder, and projects clean embeddings to token logits through a learnable unembedding matrix. The \(\mathbf{x}\)-prediction parameterization is also more effective for high-dimensional data, as shown in [3].
Progressive distillation
Progressive distillation [2] distills a many-step teacher into a fast few-step student by repeatedly halving the number of sampling steps. In each round, the student is trained so that a single sampling step matches two teacher steps. The distilled student then becomes the teacher for the next round, and the procedure repeats. This approach effectively reduces a many-step sampler down to a few steps without losing generation quality.
3Distilling ELF
Training objective
We distill a fully trained ELF teacher into a few-step student. Given a time interval \([t,r]\), we move the noisy embedding \(\mathbf{z}_t\) to \(\mathbf{z}_r\) by running the teacher's numerical (e.g., Euler) solver from \(t\) to \(r\). We then convert the resulting displacement \(\mathbf{z}_r-\mathbf{z}_t\) into a new target:
\[ \tilde{\mathbf{x}} = \mathbf{z}_t + \frac{1-t}{\,r-t\,}\,(\mathbf{z}_r-\mathbf{z}_t). \]We train the student by minimizing
\[ \mathcal{L}_{\text{distill}} = \mathbb{E}_{t,\mathbf{x},\boldsymbol{\epsilon}} \big\| \mathbf{x}_{\theta}(\mathbf{z}_t,t) - \tilde{\mathbf{x}} \big\|^2 . \]In this way, the student's one step approximates the teacher's many steps.
We keep ELF's two-branch setup and simply replace the denoising branch's MSE objective with \(\mathcal{L}_{\text{distill}}\); the decoder branch is unchanged and keeps its cross-entropy loss.
Curriculum
Following prior work (Progressive Distillation [2], SDTT [4], Duo [5]), we use a five-round curriculum: halve the number of student steps each round, initializing each student from the previous one (the first round starts from the teacher). We keep the teacher's total budget fixed at 64 steps and spread it evenly across the student's steps: an \(N\)-step student uses \(64/N\) teacher sub-steps, so the two step counts always multiply to 64. This means that a single student step corresponds to \(64/N\) teacher ODE steps.
How to sample the time steps?
Both the student time steps and the teacher's intermediate substeps are sampled from the same logit-normal schedule used to train the teacher. This schedule places more probability mass near the noise end (\(t=0\)) and less near the data end (\(t=1\)).
4Experiments
Setup
We evaluate model performance by generating 1,000 samples. We report generative perplexity (Gen. PPL, lower is better) under GPT-2-Large and mean unigram entropy (higher is better). The former measures sample quality, while the latter measures diversity. All experiments use the same SDE-inspired sampler, varying the number of sampling steps and the guidance strength (see setup below).
Full training & sampling setup
Training. We train on OpenWebText [6] (~9B tokens), packing sequences to length \(L=1024\), one epoch per distillation round. As in ELF, we use training-time classifier-free guidance with self-conditioning: a CFG scale is sampled per example, and the teacher and student are conditioned on the same self-conditioning CFG scale.
Sampling. We use an SDE-inspired sampler with the same logit-normal time schedule from training. For 1-, 2-, 4-, and 8-step generation, we set the noise re-injection scale to \(\gamma=1.5\) and the self-conditioning CFG scale to \(2.5\). For 16- and 32-step generation, we use \(\gamma=2.0\) and a self-conditioning CFG scale of \(2.0\).
How the SDE-inspired sampler works
The SDE-inspired sampler re-injects noise at each step: the current sample is mixed with fresh noise \(\boldsymbol{\epsilon}\) and its time step is moved back to an earlier, noisier time \(t_{\text{back}}=\alpha\,t_i\) with \(\alpha = 1-\gamma\,\Delta t\). The scale \(\gamma\) controls how much noise is re-injected; \(\gamma=0\) recovers the plain ODE (Euler) step. We use \(\gamma=1.5\) for 1–8 step generation and \(\gamma=2.0\) for 16- and 32-step generation. This introduces stochasticity into the sampling process and reduces error accumulation.
Results
Few-step generation. Against representative distilled discrete baselines (MDLM [7] + SDTT [4], Duo [5] + DCD) and the continuous flow-matching model FMLM [8], ELF+PD achieves the lowest generative perplexity at every sampling budget while maintaining reasonable entropy. Benefiting from the data efficiency of ELF, ELF+PD also uses substantially fewer training tokens than the baselines.
Few-step generation: exact numbers
| Steps | MDLM + SDTT | Duo + DCD | FMLM | ELF+PD (Ours) | ||||
|---|---|---|---|---|---|---|---|---|
| PPL ↓ | Entropy ↑ | PPL ↓ | Entropy ↑ | PPL ↓ | Entropy ↑ | PPL ↓ | Entropy ↑ | |
| 1 | 1260.60 | 5.26 | 5743.90 | 6.02 | 168.30 | 5.17 | 136.10 | 5.26 |
| 2 | 877.22 | 5.34 | 891.16 | 5.41 | 133.29 | 5.25 | 68.25 | 5.24 |
| 4 | 339.73 | 5.38 | 250.86 | 5.37 | 111.31 | 5.26 | 34.33 | 5.16 |
| 8 | 112.66 | 5.41 | 118.21 | 5.41 | 86.50 | 5.36 | 23.18 | 5.07 |
| 16 | 57.74 | 5.39 | 78.74 | 5.43 | 63.63 | 5.29 | 22.12 | 5.06 |
| 32 | 40.41 | 5.34 | 63.98 | 5.40 | 45.09 | 5.25 | 21.32 | 5.04 |
Effect of the distillation curriculum. Early-round models perform well only at larger sampling budgets and collapse to degenerate outputs at small budgets, whereas later-round models substantially improve 1–4 step generation while maintaining reasonable entropy. After the final round, the one-step student achieves the best 1-, 2-, 4-, and 8-step generative perplexity.
| Round | 1 step | 2 steps | 4 steps | 8 steps | ||||
|---|---|---|---|---|---|---|---|---|
| PPL | Entropy | PPL | Entropy | PPL | Entropy | PPL | Entropy | |
| r1 | 4.9 | 1.72* | 119.4 | 5.10 | 143.4 | 5.37 | 58.0 | 5.26 |
| r2 | 1.7 | 0.64* | 171.1 | 5.24 | 128.6 | 5.40 | 36.6 | 5.24 |
| r3 | 30.7 | 3.60* | 153.1 | 5.44 | 69.0 | 5.37 | 27.9 | 5.20 |
| r4 | 165.9 | 5.35 | 92.2 | 5.36 | 46.1 | 5.28 | 27.3 | 5.19 |
| r5 | 136.1 | 5.26 | 68.2 | 5.24 | 34.3 | 5.16 | 23.2 | 5.07 |
5What the samples look like
Examples from the final one-step student, at each sampling budget using the same sampling configuration.
References
- K. Hu, L. Qiu, et al. ELF: Embedded Language Flows. arXiv:2605.10938.
- T. Salimans & J. Ho. Progressive Distillation for Fast Sampling of Diffusion Models. ICLR 2022. arXiv:2202.00512.
- T. Li et al. Back to Basics: Let Denoising Generative Models Denoise. arXiv:2511.13720.
- J. Deschenaux & C. Gulcehre. Beyond Autoregression: Fast LLMs via Self-Distillation Through Time (SDTT). ICLR 2025. arXiv:2410.21035.
- S. Sahoo et al. The Diffusion Duality (Duo). ICML 2025. arXiv:2506.10892.
- A. Gokaslan & V. Cohen. OpenWebText Corpus. skylion007.github.io/OpenWebTextCorpus, 2019.
- S. Sahoo et al. Simple and Effective Masked Diffusion Language Models (MDLM). NeurIPS 2024. arXiv:2406.07524.
- Lee et al. Flow Map Language Models: One-step Language Modeling via Continuous Denoising (FMLM). arXiv:2602.16813.