F2D2: Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Abstract

Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models – diffusion and flow-based models – still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single flow map. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow to outperform a 1024 step flow matching model with only a single additional backward NFE.

Method

In continuous normalizing flows (CNFs), sampling and likelihood computation are governed by two coupled ODEs that share the same underlying velocity field v_θ:

Sampling: dx_t/dt = v_θ(x_t, t)
Likelihood: d(log p_t)/dt = −div(v_θ(x_t, t))

This shared structure means both can be distilled together. F2D2 learns a joint flow map Φ_Y over the joint state y_t = (x_t, z_t)^⊤, with a linear parametrization for each subsystem: Φ_X(x̂_t, t, s) = x̂_t + (s−t) u_θ(x̂_t, t, s) for sampling, and Φ_Z(x_t, t, s) = z_t + (s−t) D_θ(x_t, t, s) for the cumulative log-likelihood. A single shared backbone f_θ outputs two heads — u_θ for the average sampling velocity and D_θ for the average divergence — trained with four complementary losses:

L_VM (velocity matching): the standard flow matching loss or teacher velocity matching loss, enforcing the flow map to recover the instantaneous velocity u_θ(x, t, t) = v(x, t).
L_u (sampling flow map condition): enforces one of the three flow map conditions (semigroup, Eulerian, or Lagrangian) on the sampling head, depending on the instantiation.
L_div (divergence matching): mirrors L_VM for the divergence head, enforcing the flow map to recover the instantaneous divergence D_θ(x, t, t) = div(u_θ(x, t, t)).
L_D (likelihood flow map condition): mirrors L_u for the divergence head, ensuring the joint map Φ_Y satisfies the requisite flow map conditions.

F2D2 is modular: it extends any existing few-step flow map model by adding only a lightweight divergence head. We provide instantiations for all three flow map conditions:

Shortcut-F2D2 — built on Shortcut Models, using the semigroup property to enforce consistency across time intervals.
MeanFlow-F2D2 — built on MeanFlow, leveraging the MeanFlow identity, which solves the Eulerian equation.
LSD-F2D2 — built on Lagrangian Self-Distillation (LSD), using the Lagrangian equation.

Quantitative Results

CIFAR-10

NLL and FID results on CIFAR-10 dataset with different numbers of Euler steps. The flow matching model here, which achieves BPD 3.12 as the NLL with 1024 steps and FID 2.60 with 200 steps, is also the teacher model we use in our Shortcut-Distill. For NLL, the closer to the teacher result (3.12 BPD) the better, and for FID, the lower the better. We denote the best results in bold, the second best with underlines, the overall best results in boxes and invalid predictions in gray color.

Method	8 Steps		4 Steps		2 Steps		1 Step
Method	NLL	FID	NLL	FID	NLL	FID	NLL	FID
Flow Matching	-9.93	20.63	-24.01	64.27	-52.85	146.24	-111.19	313.54
Shortcut Model	-12.07	7.10	-28.03	9.63	-60.01	16.04	-124.15	27.28
Shortcut-Distill (Ours)	-11.42	5.01	-26.82	5.41	-57.72	7.13	-119.42	12.75
MeanFlow	-9.00	4.34	-21.26	5.14	-46.63	2.84	-97.59	2.80
Shortcut-F2D2 (Ours)	3.07	8.78	3.26	10.21	2.73	15.58	0.20	27.35
Shortcut-Distill-F2D2 (Ours)	3.12	5.68	2.87	5.96	2.38	7.35	1.62	13.76
MeanFlow-F2D2 (Ours)	2.38	3.78	1.34	4.37	1.63	2.59	3.51	3.02

ImageNet 64×64

Negative log-likelihood (NLL) measured in BPD and FID results on ImageNet 64×64 dataset with different numbers of Euler steps. The flow matching model here, which achieves BPD 3.34 as the NLL with 1024 steps and FID 13.09 with 200 steps, is also the teacher model we use in our Shortcut-Distill. For NLL, the closer to the teacher result (3.34 BPD) the better, and for FID, the lower the better. We denote the best results in bold and invalid predictions in gray color.

Method	8 Steps		4 Steps		2 Steps		1 Step
Method	NLL	FID	NLL	FID	NLL	FID	NLL	FID
Flow Matching	-6.41	31.60	-15.87	68.55	-35.23	170.00	-74.54	363.39
Shortcut-Distill (Ours)	-9.03	19.47	-22.30	21.73	-49.01	28.12	-102.07	42.72
Shortcut-Distill-F2D2 (Ours)	3.51	21.91	3.94	24.05	3.97	29.83	1.54	44.02

CelebA-64

Negative log-likelihood (NLL) measured in BPD and FID results on CelebA-64 dataset with different numbers of Euler steps. The flow matching model here achieves BPD 1.75 in 1024 steps and FID 2.48 in 200 steps. For NLL, closer to the flow matching estimate is better; for FID, lower is better. Best in bold; invalid in gray.

Method	8 Steps		4 Steps		2 Steps		1 Step
Method	NLL	FID	NLL	FID	NLL	FID	NLL	FID
Flow Matching	-6.88	30.60	-16.39	58.14	-36.46	120.65	-77.51	181.23
LSD	-6.78	3.33	-14.89	4.04	-32.72	6.32	-69.83	12.96
LSD-F2D2 (Ours)	1.64	2.41	1.75	2.75	1.73	3.86	1.64	6.94

Application: Self-Guidance

Efficient likelihood computation unlocks new inference-time strategies. We propose a lightweight self-guidance method that optimizes the initial noise x₀ to maximize likelihood before running sampling — requiring only a single additional backward pass through the divergence head. This guidance dramatically improves sample quality at no architectural cost.

Key result: a 2-step MeanFlow-F2D2 model with self-guidance surpasses a 1024-step standard flow matching model on CIFAR-10, using orders of magnitude fewer NFEs.

FID vs. NFEs for MeanFlow-F2D2 with and without self-guidance. Guidance consistently improves quality, and the 2-step guided model outperforms the 1024-step baseline.

BibTeX


        @inproceedings{
          ai2026joint,
          title={Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models},
          author={Xinyue Ai and Yutong He and Albert Gu and Ruslan Salakhutdinov and J Zico Kolter and Nicholas Matthew Boffi and Max Simchowitz},
          booktitle={The Fourteenth International Conference on Learning Representations},
          year={2026},
          url={https://openreview.net/forum?id=8uZ5UdIul2}
        }