F2D2: Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

1Carnegie Mellon University, 2Peking University

*Equal contribution    Work done at CMU

ICLR 2026

Flow-based generative models need 100–1000 NFEs just to compute one likelihood — making downstream applications prohibitively expensive. F2D2 cuts this to 1–8 NFEs for both sampling and likelihood, simultaneously. As a result, a few-step model with self-guidance outperforms a 1024-step baseline.

Abstract

Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models – diffusion and flow-based models – still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single flow map. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow to outperform a 1024 step flow matching model with only a single additional backward NFE.

Method

In continuous normalizing flows (CNFs), sampling and likelihood computation are governed by two coupled ODEs that share the same underlying velocity field vθ:

  • Sampling: dxt/dt = vθ(xt, t)
  • Likelihood: d(log pt)/dt = −div(vθ(xt, t))

This shared structure means both can be distilled together. F2D2 learns a joint flow map ΦY over the joint state yt = (xt, zt), with a linear parametrization for each subsystem: ΦX(x̂t, t, s) = x̂t + (s−t) uθ(x̂t, t, s) for sampling, and ΦZ(xt, t, s) = zt + (s−t) Dθ(xt, t, s) for the cumulative log-likelihood. A single shared backbone fθ outputs two heads — uθ for the average sampling velocity and Dθ for the average divergence — trained with four complementary losses:

  • LVM (velocity matching): the standard flow matching loss or teacher velocity matching loss, enforcing the flow map to recover the instantaneous velocity uθ(x, t, t) = v(x, t).
  • Lu (sampling flow map condition): enforces one of the three flow map conditions (semigroup, Eulerian, or Lagrangian) on the sampling head, depending on the instantiation.
  • Ldiv (divergence matching): mirrors LVM for the divergence head, enforcing the flow map to recover the instantaneous divergence Dθ(x, t, t) = div(uθ(x, t, t)).
  • LD (likelihood flow map condition): mirrors Lu for the divergence head, ensuring the joint map ΦY satisfies the requisite flow map conditions.

F2D2 is modular: it extends any existing few-step flow map model by adding only a lightweight divergence head. We provide instantiations for all three flow map conditions:

  • Shortcut-F2D2 — built on Shortcut Models, using the semigroup property to enforce consistency across time intervals.
  • MeanFlow-F2D2 — built on MeanFlow, leveraging the MeanFlow identity, which solves the Eulerian equation.
  • LSD-F2D2 — built on Lagrangian Self-Distillation (LSD), using the Lagrangian equation.

Sample Quality

CIFAR-10 sample quality comparison

CIFAR-10 samples across 1–8 NFEs. F2D2 variants maintain visual quality at all step counts, while baselines degrade substantially with fewer steps.

Quantitative Results

CIFAR-10

NLL and FID results on CIFAR-10 dataset with different numbers of Euler steps. The flow matching model here, which achieves BPD 3.12 as the NLL with 1024 steps and FID 2.60 with 200 steps, is also the teacher model we use in our Shortcut-Distill. For NLL, the closer to the teacher result (3.12 BPD) the better, and for FID, the lower the better. We denote the best results in bold, the second best with underlines, the overall best results in boxes and invalid predictions in gray color.

Method 8 Steps 4 Steps 2 Steps 1 Step
NLLFID NLLFID NLLFID NLLFID
Flow Matching -9.9320.63 -24.0164.27 -52.85146.24 -111.19313.54
Shortcut Model -12.077.10 -28.039.63 -60.0116.04 -124.1527.28
Shortcut-Distill (Ours) -11.425.01 -26.825.41 -57.727.13 -119.4212.75
MeanFlow -9.004.34 -21.265.14 -46.632.84 -97.592.80
Shortcut-F2D2 (Ours) 3.078.78 3.2610.21 2.7315.58 0.2027.35
Shortcut-Distill-F2D2 (Ours) 3.125.68 2.875.96 2.387.35 1.6213.76
MeanFlow-F2D2 (Ours) 2.383.78 1.344.37 1.632.59 3.513.02

ImageNet 64×64

Negative log-likelihood (NLL) measured in BPD and FID results on ImageNet 64×64 dataset with different numbers of Euler steps. The flow matching model here, which achieves BPD 3.34 as the NLL with 1024 steps and FID 13.09 with 200 steps, is also the teacher model we use in our Shortcut-Distill. For NLL, the closer to the teacher result (3.34 BPD) the better, and for FID, the lower the better. We denote the best results in bold and invalid predictions in gray color.

Method 8 Steps 4 Steps 2 Steps 1 Step
NLLFID NLLFID NLLFID NLLFID
Flow Matching -6.4131.60 -15.8768.55 -35.23170.00 -74.54363.39
Shortcut-Distill (Ours) -9.0319.47 -22.3021.73 -49.0128.12 -102.0742.72
Shortcut-Distill-F2D2 (Ours) 3.5121.91 3.9424.05 3.9729.83 1.5444.02

CelebA-64

Negative log-likelihood (NLL) measured in BPD and FID results on CelebA-64 dataset with different numbers of Euler steps. The flow matching model here achieves BPD 1.75 in 1024 steps and FID 2.48 in 200 steps. For NLL, closer to the flow matching estimate is better; for FID, lower is better. Best in bold; invalid in gray.

Method 8 Steps 4 Steps 2 Steps 1 Step
NLLFID NLLFID NLLFID NLLFID
Flow Matching -6.8830.60 -16.3958.14 -36.46120.65 -77.51181.23
LSD -6.783.33 -14.894.04 -32.726.32 -69.8312.96
LSD-F2D2 (Ours) 1.642.41 1.752.75 1.733.86 1.646.94

2D Density Estimation

2D checkerboard density estimation

Density estimation on a 2D checkerboard. LSD-F2D2 accurately recovers the target density distribution with only 1 NFE, preserving spatial structure and density values.

Application: Self-Guidance

Efficient likelihood computation unlocks new inference-time strategies. We propose a lightweight self-guidance method that optimizes the initial noise x0 to maximize likelihood before running sampling — requiring only a single additional backward pass through the divergence head. This guidance dramatically improves sample quality at no architectural cost.

Key result: a 2-step MeanFlow-F2D2 model with self-guidance surpasses a 1024-step standard flow matching model on CIFAR-10, using orders of magnitude fewer NFEs.

Self-guidance results

FID vs. NFEs for MeanFlow-F2D2 with and without self-guidance. Guidance consistently improves quality, and the 2-step guided model outperforms the 1024-step baseline.

BibTeX


        @inproceedings{
          ai2026joint,
          title={Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models},
          author={Xinyue Ai and Yutong He and Albert Gu and Ruslan Salakhutdinov and J Zico Kolter and Nicholas Matthew Boffi and Max Simchowitz},
          booktitle={The Fourteenth International Conference on Learning Representations},
          year={2026},
          url={https://openreview.net/forum?id=8uZ5UdIul2}
        }