Progressive Distillation of ELF Few-step generation for embedded language flows

Linlu Qiu*, Keya Hu*, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He
MIT  ·  * equal contribution; order decided by a coin flip

TL;DR

Continuous diffusion and flow-based language models such as Embedded Language Flows (ELF) have demonstrated their potential for language generation with a small number of sampling steps: ELF generates length-1024 sequences in 8-32 steps and outperforms prior discrete and continuous diffusion language models that use distillation. However, ELF's performance degrades as sampling steps are further reduced, and few-step or even one-step generation remains a challenge for ELF.

We introduce ELF with progressive distillation (ELF+PD), which distills a pretrained ELF teacher into a student model for few-step language generation. After a five-round distillation curriculum, ELF+PD achieves strong performance across 1-32 sampling steps, outperforming distilled discrete and continuous DLM baselines. Our results demonstrate the potential of continuous DLMs with distillation for fast language generation.

1From many steps to few steps

Diffusion and flow-based language models are appealing because they allow flexible generation: in any order, with any number of steps. In theory that means fast generation; however, in practice these models often require many iterative denoising steps.

Embedded Language Flow (ELF) [1] is a language model in a continuous embedding space, based on continuous-time Flow Matching. It already pushes generation to a small number of steps, e.g., 8-32, without using distillation. But further reducing the sampling budget, toward a handful of steps or a single one, degrades ELF's performance. Few-step, or even one-step, generation remains challenging for diffusion language models.

Progressive distillation [2] has been shown to substantially reduce the number of sampling steps while preserving generation quality. Motivated by this approach, we introduce ELF with progressive distillation (ELF+PD), distilling a trained ELF model into a student that replaces many denoising steps with a single jump.

2Background

Embedded Language Flow

ELF is formulated in continuous embedding space using Flow Matching: it performs denoising primarily in this space, and converts clean embeddings back to discrete tokens only at the final step. With a clean embedding \(\mathbf{x}\) and noise \(\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})\), the noisy latent is defined by linear interpolation,

\[ \mathbf{z}_t = t\,\mathbf{x} + (1-t)\,\boldsymbol{\epsilon}, \qquad t \in [0,1], \]

so \(\mathbf{z}_0\) is pure noise and \(\mathbf{z}_1\) is clean data. The instantaneous flow velocity is the time derivative of the path \(\mathbf{v} = \tfrac{d\mathbf{z}_t}{dt} = \mathbf{x}-\boldsymbol{\epsilon}\). Following prior work [3], ELF predicts the clean embeddings \(\mathbf{x}\) (\(\mathbf{x}\)-prediction) and trains with

\[ \mathcal{L}_{\text{MSE}} = \mathbb{E}_{t,\mathbf{x},\boldsymbol{\epsilon}} \big\| \mathbf{v}_\theta(\mathbf{z}_t,t) - \mathbf{v} \big\|^2 = \mathbb{E}_{t,\mathbf{x},\boldsymbol{\epsilon}} \frac{1}{(1-t)^2} \big\| \mathbf{x}_\theta(\mathbf{z}_t,t) - \mathbf{x} \big\|^2 , \]

where \(\mathbf{v}_\theta = (\mathbf{x}_\theta-\mathbf{z}_t)/(1-t)\).

ELF uses a shared-weight denoiser and decoder: the same network is trained with two branches: the denoising branch is trained with the MSE loss above to predict clean embeddings, and the decoding branch is trained with a cross-entropy loss to predict discrete tokens. The mode is controlled by special mode tokens.

At inference, ELF iteratively transforms noisy samples into clean embeddings by solving the ODE using a numerical (e.g., Euler) solver, and decodes the final embeddings back to discrete tokens by switching to the decoding mode.

Why predict x, and why one shared network?

Because \(\mathbf{x}\)-prediction outputs the clean embeddings, this naturally aligns the objective of denoising with the objective of predicting clean discrete tokens at the final step. Therefore, ELF does not require a separate decoder. Instead, it shares weights between the denoiser and decoder, and projects clean embeddings to token logits through a learnable unembedding matrix. The \(\mathbf{x}\)-prediction parameterization is also more effective for high-dimensional data, as shown in [3].

Progressive distillation

Progressive distillation [2] distills a many-step teacher into a fast few-step student by repeatedly halving the number of sampling steps. In each round, the student is trained so that a single sampling step matches two teacher steps. The distilled student then becomes the teacher for the next round, and the procedure repeats. This approach effectively reduces a many-step sampler down to a few steps without losing generation quality.

3Distilling ELF

Training objective

We distill a fully trained ELF teacher into a few-step student. Given a time interval \([t,r]\), we move the noisy embedding \(\mathbf{z}_t\) to \(\mathbf{z}_r\) by running the teacher's numerical (e.g., Euler) solver from \(t\) to \(r\). We then convert the resulting displacement \(\mathbf{z}_r-\mathbf{z}_t\) into a new target:

\[ \tilde{\mathbf{x}} = \mathbf{z}_t + \frac{1-t}{\,r-t\,}\,(\mathbf{z}_r-\mathbf{z}_t). \]

We train the student by minimizing

\[ \mathcal{L}_{\text{distill}} = \mathbb{E}_{t,\mathbf{x},\boldsymbol{\epsilon}} \big\| \mathbf{x}_{\theta}(\mathbf{z}_t,t) - \tilde{\mathbf{x}} \big\|^2 . \]

In this way, the student's one step approximates the teacher's many steps.

We keep ELF's two-branch setup and simply replace the denoising branch's MSE objective with \(\mathcal{L}_{\text{distill}}\); the decoder branch is unchanged and keeps its cross-entropy loss.

Curriculum

Following prior work (Progressive Distillation [2], SDTT [4], Duo [5]), we use a five-round curriculum: halve the number of student steps each round, initializing each student from the previous one (the first round starts from the teacher). We keep the teacher's total budget fixed at 64 steps and spread it evenly across the student's steps: an \(N\)-step student uses \(64/N\) teacher sub-steps, so the two step counts always multiply to 64. This means that a single student step corresponds to \(64/N\) teacher ODE steps.

How one distillation step works noise data zt zr Teacher · many small ODE steps Student · one big jump distill — the student learns to land where the teacher does, in a single step Five-round curriculum the teacher keeps 64 steps; the student takes half as many each round Teacher 64 steps r1 16 steps r2 8 steps r3 4 steps r4 2 steps r5 1 step
Distilling a teacher trajectory into a few-step student. (Top) the teacher reaches the data through many small denoising steps; the student is trained to make that whole jump in one step, matching the teacher's trajectory. (Bottom) a five-round curriculum halves the student's step count each round (16 → 8 → 4 → 2 → 1) while the teacher's 64-step budget (top bar) stays fixed, so each student step spans \(64/N\) teacher steps.
How to sample the time steps?

Both the student time steps and the teacher's intermediate substeps are sampled from the same logit-normal schedule used to train the teacher. This schedule places more probability mass near the noise end (\(t=0\)) and less near the data end (\(t=1\)).

Logit-normal time-step density (shifted mean −1.5) density t 0 0.5 1 noise data

4Experiments

Setup

We evaluate model performance by generating 1,000 samples. We report generative perplexity (Gen. PPL, lower is better) under GPT-2-Large and mean unigram entropy (higher is better). The former measures sample quality, while the latter measures diversity. All experiments use the same SDE-inspired sampler, varying the number of sampling steps and the guidance strength (see setup below).

Full training & sampling setup

Training. We train on OpenWebText [6] (~9B tokens), packing sequences to length \(L=1024\), one epoch per distillation round. As in ELF, we use training-time classifier-free guidance with self-conditioning: a CFG scale is sampled per example, and the teacher and student are conditioned on the same self-conditioning CFG scale.

Sampling. We use an SDE-inspired sampler with the same logit-normal time schedule from training. For 1-, 2-, 4-, and 8-step generation, we set the noise re-injection scale to \(\gamma=1.5\) and the self-conditioning CFG scale to \(2.5\). For 16- and 32-step generation, we use \(\gamma=2.0\) and a self-conditioning CFG scale of \(2.0\).

How the SDE-inspired sampler works

The SDE-inspired sampler re-injects noise at each step: the current sample is mixed with fresh noise \(\boldsymbol{\epsilon}\) and its time step is moved back to an earlier, noisier time \(t_{\text{back}}=\alpha\,t_i\) with \(\alpha = 1-\gamma\,\Delta t\). The scale \(\gamma\) controls how much noise is re-injected; \(\gamma=0\) recovers the plain ODE (Euler) step. We use \(\gamma=1.5\) for 1–8 step generation and \(\gamma=2.0\) for 16- and 32-step generation. This introduces stochasticity into the sampling process and reduces error accumulation.

0 ti tback ti+1 1 noise data Δt γΔt
One step of the SDE-inspired sampler. Sampling integrates from noise (\(t=0\)) to data (\(t=1\)), advancing from \(t_i\) to \(t_{i+1}\) over an interval \(\Delta t\). Before denoising, the sampler re-injects noise and steps back from \(t_{i+1}\) by \(\gamma\,\Delta t\) to an earlier, noisier time \(t_{\text{back}}\) (the curved arrow), then denoises from there. The scale \(\gamma\) sets how much noise is re-injected; \(\gamma=0\) recovers a plain ODE step. We use \(\gamma=1.5\) for 1–8 step generation and \(\gamma=2.0\) for 16- and 32-step generation.

Results

ELF+PD vs distilled baselines: perplexity vs sampling steps, and training-token budget.
ELF with progressive distillation (ELF+PD) vs. distilled DLM baselines on OpenWebText. (a) Generative perplexity against sampling steps: ELF+PD (orange) outperforms MDLM + SDTT, Duo + DCD, and FMLM across the full 1–32 step settings. Per-point labels indicate entropy. (b) Estimated training tokens: ELF+PD uses only 90B tokens (2.0× the base model training), while other baselines require 550–577B (12×+) training tokens.

Few-step generation. Against representative distilled discrete baselines (MDLM [7] + SDTT [4], Duo [5] + DCD) and the continuous flow-matching model FMLM [8], ELF+PD achieves the lowest generative perplexity at every sampling budget while maintaining reasonable entropy. Benefiting from the data efficiency of ELF, ELF+PD also uses substantially fewer training tokens than the baselines.

Few-step generation: exact numbers
Steps MDLM + SDTTDuo + DCD FMLMELF+PD (Ours)
PPL Entropy PPL Entropy PPL Entropy PPL Entropy
11260.605.265743.906.02168.305.17136.105.26
2877.225.34891.165.41133.295.2568.255.24
4339.735.38250.865.37111.315.2634.335.16
8112.665.41118.215.4186.505.3623.185.07
1657.745.3978.745.4363.635.2922.125.06
3240.415.3463.985.4045.095.2521.325.04
Few-step unconditional generation on OpenWebText. We compare ELF+PD after five-round distillation with distilled discrete and continuous diffusion language model baselines.

Effect of the distillation curriculum. Early-round models perform well only at larger sampling budgets and collapse to degenerate outputs at small budgets, whereas later-round models substantially improve 1–4 step generation while maintaining reasonable entropy. After the final round, the one-step student achieves the best 1-, 2-, 4-, and 8-step generative perplexity.

Round 1 step2 steps4 steps8 steps
PPLEntropyPPLEntropyPPLEntropyPPLEntropy
r14.91.72*119.45.10143.45.3758.05.26
r21.70.64*171.15.24128.65.4036.65.24
r330.73.60*153.15.4469.05.3727.95.20
r4165.95.3592.25.3646.15.2827.35.19
r5136.15.2668.25.2434.35.1623.25.07
Results across distillation rounds and sampling steps. * indicates degenerate results (entropy below 5.0).

5What the samples look like

Examples from the final one-step student, at each sampling budget using the same sampling configuration.

1 step · entropy 5.28 · Gen. PPL 132.89
Naturally Bloomberg has introduced a new Black Black S on the Laptop, which which the widely seems it it sound. — Ahan Geet (@Just AG) Uhr)) The The Black S Sport also well, Samsung has that it will likely to include its Hot Roller set.. Bull and want to are on the first Smartphone for A Mr Bols has came from his Apple web company ago ago ago ago, told my website earlier Tuesday that I will be waiting to see a new speedset soon. And he know if Apple will consider add with the mod and the software that the refurbished mini- Sculble. In the addition to this tablet, it offers a lot more quickly. Obviously, the first latest version of iPhone's the Sure Heartn't Version, although it's not considered the boosted design level of the OS many havesupposedly launched in Fia but, for However, there is unclear, but not unclear whether the is wills up for how the game will as soon as Software, translate. We'll a some people on to various pictures about, and I'll have a spring next week and watch a review on Samsung's iPhone Stream soon…
2 steps · entropy 5.47 · Gen. PPL 64.76
“I think that we have always had the responsibility to be able to make our environment more accessible to any users. WipWe have been truly one of the best gaming companies in the world. We’s been quite large in the years. Over the years, the company known, Wi-Breth has given them with a ideas on how to develop their own software. Despite the custom design design Dearms was unable to prove the company they wanted to become a portable programmeer, but then it took a lot of time that was quite ambitious and a little bit tedious. A littleerson, Vice manager of programming management at WipBreth and was someone who was helped with the development of a program known as OpenPreadrder. “We wanted to create a better, clean,, efficient solution, and then we could do it, that wasns not exactly what we wanted to,” Wiet said. “Our goal was to create a robust free loopor that allows live programs easily to your Windows devices. Now, this is a huge step because it’s fast and high-a programable, and because you don’t have to create full programs to all of your Windows devices, you definitely don’t have to beable to create everything for your Windows 10 devices, which you can’t really do. This will be a big step to your your own laptops, but it a lot more cleaner and more fun for the whole environment.” Conclusion: The entire version of Wi-Breth wastracted much attention by the developers and were hoping to develop their. decision: By the first time when after re got back into the project, they devised very OpenCrunter Engine and I am happy that they decided in they, they were able to create this tool. It is an extremely innovative, and would bring every functionality in the future. As very simple, it doesn’t have great a lot tweak support, but there is not much that can be required. WieBreal has incredible portability for use every any Windows Windows 10 devices. It works well with a large looper function, and can can be used all around the world in the future. “I think the project will have be completed but it’s not really a little difficult, at this point we hope to get it done again. We’ll have a to more more details in the future…
4 steps · entropy 5.32 · Gen. PPL 30.76
"I'm confident that I'll be the best player on the team, and I don't know how if you'll get him back as as long as he can. "He's amazing. I've scored one of 15 games of the season, and he played some games and did pretty well. I just take a lot of time to get back. He's a great young winger, and he doesn't have a lot of time left with him. He's got a great team, so he's a really good strikeer and he's got a lot of experience on team. But's squad is very good. He's a really, good guy." Spritley said that Reilly's being on the field this Saturday. "I don't know how he will be vs. Blues in Saturday, but I don't know. I don't know that. I just don't think he'll do anything. That's possible, though. I'm confident that." The U.S. Food and Drug Drug Agency is warning that there could be about a half billion people' use of eadain drug. The drug, known as Common Deperdomidicent's Disease (CCD), has become a popular of more than 1 million people over the past 10 years and has a range conditions such as hepat disease, cancer, obesity, and traumatic heart disease. However, eadain drug is already under development in many countries over the past decade. And, in a trend that's new, researchers are confident that eaedain drug can potentially create danger to many organisms and other medical diseases, such as heart cancer. In a study estimate, about 50% of people using eaddain drug comes from other parts of the human body. Investigators believe eadain drug is that could reduce danger to life, but is not expected to lead to death. "We're expecting an up to two billion billion people of use over the next 10 or years,"…
8 steps · entropy 5.24 · Gen. PPL 19.22
Now let us take an deeper look at the decline in basic education programs over the last couple of decades. During the 1990s, when the basic industry was weakening, low-income people were declining. They didn’t have a good chance to learn it. But by the 1990s, basic education was weaker—low-income people were very lacking of money. They didn’t have the resources to understand the fundamental education program, which is a lack of adequate understanding of the program. What we’ve seen is the fact that the program is a political problem. The poor understanding of the fundamental program is a political problem for the low-income people. The basic program isn’t a politically political problem. The problem is that many people that education isn’t a political problem. It is not a political policy problem. They’re far less aware of how the fundamental program is being conducted. As we’ve seen, low-income people don’t get access to basic education. They don’t have enough scientific expertise and have enough of political experience. I’m a professor of politics in Minnesota and I am a professor of political science. I also talk about the fact that the majority of low-income people (65%) don’t have the adequate money to learn to understand the basic program—which, I know, hasn’t happened. We need to become more aware that basic education for low-income people is declining. This is really an important part of making sure we have more understanding what’s behind these declines. When it comes to basic education, you should become more aware that the program isn’t based on knowledge of what’s happening and how it’s happening, so you can move on step further and get a little bit more sleep…

References

  1. K. Hu, L. Qiu, et al. ELF: Embedded Language Flows. arXiv:2605.10938.
  2. T. Salimans & J. Ho. Progressive Distillation for Fast Sampling of Diffusion Models. ICLR 2022. arXiv:2202.00512.
  3. T. Li et al. Back to Basics: Let Denoising Generative Models Denoise. arXiv:2511.13720.
  4. J. Deschenaux & C. Gulcehre. Beyond Autoregression: Fast LLMs via Self-Distillation Through Time (SDTT). ICLR 2025. arXiv:2410.21035.
  5. S. Sahoo et al. The Diffusion Duality (Duo). ICML 2025. arXiv:2506.10892.
  6. A. Gokaslan & V. Cohen. OpenWebText Corpus. skylion007.github.io/OpenWebTextCorpus, 2019.
  7. S. Sahoo et al. Simple and Effective Masked Diffusion Language Models (MDLM). NeurIPS 2024. arXiv:2406.07524.
  8. Lee et al. Flow Map Language Models: One-step Language Modeling via Continuous Denoising (FMLM). arXiv:2602.16813.