Stage-aware multi-model sampling for fast video generation

FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Large model where capacity matters (early/late), small model where it does not (middle). Up to 1.65× faster inference with 57.35% fewer FLOPs while preserving fidelity and temporal coherence.

Yonsei University

Overview

FlowBlending allocates model capacity by stage. It uses the large model to establish global semantics early and to refine details late, while delegating intermediate steps to a smaller model when the velocity divergence is minimal.

Teaser image

Abstract

In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage.

Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions.

Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65× faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to additional speedup.

Contents

Quick navigation to the main results and supplementary ablations.

Qualitative Results

Videos are generated from the same initial noise and text prompt under different model allocation strategies. FlowBlending uses the large model in the early/late stages and the small model in the intermediate stage. FlowBlending produces results that are visually comparable to the large-only baseline despite being faster.

Prompt: Cinematic shot of Van Gogh's selfie, Van Gogh style.

Small model3129 TFLOPs
Large model29950 TFLOPs
FlowBlending19222 TFLOPs

Prompt: A teddy bear washing the dishes.

Small model3129 TFLOPs
Large model29950 TFLOPs
FlowBlending19222 TFLOPs

Prompt: A polar bear is playing guitar.

Small model3129 TFLOPs
Large model29950 TFLOPs
FlowBlending19222 TFLOPs

Schedule Comparison

Early-Stage Comparison (LLL vs. LSS vs. SSS vs. SLL)

LSS (large only in early steps) closely matches LLL (large-only) in structure and motion, while SSS (small-only) exhibits temporal inconsistency and semantic misalignment. SLL (small only in early steps) likewise produces structure and motion patterns highly similar to SSS. This shows that the early stages are crucial for establishing global semantic and structural attributes.

Prompt: A jellyfish floating through the ocean, with bioluminescent tentacles.

LLL29950 TFLOPs
LSS13858 TFLOPs
SSS3129 TFLOPs
SLL24586 TFLOPs

Late-Stage Comparison (LLL vs. LSS vs. LSL)

LSS preserves global structure similar to LLL but exhibits some artifacts. Reintroducing the large model only during the late stage (LSL) restores detail and reduces flicker, demonstrating that the late denoising stage is capacity-sensitive. Notably, the LSL schedule attains quality nearly indistinguishable from LLL while retaining the efficiency benefits of using the small model for most of the trajectory. Please zoom in to view the figures in detail.

Prompt: an elephant spraying itself with water using its trunk to cool down

LLL29950 TFLOPs
LSS13858 TFLOPs
LSL19222 TFLOPs

Compatibility

FlowBlending schedules remain compatible with existing acceleration techniques. Below we show examples with a DPM++ solver and a step-distilled small model.

DPM++ Solver

Ours (LSL) is compatible with DPM++ solvers, reproducing similar videos to the videos using only the large model (LLL). Please zoom in to view the figures in detail

Prompt: … a person's hand holding a fork and cutting into a pastry …

LLL1748 TFLOPs
LSL928 TFLOPs
SSS257 TFLOPs

Prompt: … a close-up of a grill with a piece of salmon on it …

LLL1748 TFLOPs
LSL928 TFLOPs
SSS257 TFLOPs

Step Distillation Model

The step distillation model (D) with small capacity can replace the original small model (S). As done with the original small model, LDL reproduces the results with LLL. In contrast, DDD does not. Please zoom in to view the figures in detail.

Prompt: … people walking on a frozen lake, pulling a boat ...

LLL3496 TFLOPs
LDL1774 TFLOPs
DDD51 TFLOPs

Prompt: … a person sprinkling flour on a ball of dough …

LLL3496 TFLOPs
LDL1774 TFLOPs
DDD51 TFLOPs

Supplementary

Early-stage boundary selection ablation

Small-model introduction point (%). We vary the point at which the small model is introduced. Up to 40%, LLL and LSS remain visually identical, but at 20% motion begins to diverge. This indicates that OURS (40%) is the earliest point to use the small model extensively while preserving the large-only baseline.

Prompt: A goat and two baby goats standing on a ledge. The goat is wearing a black blanket, and the baby goats are wearing a red and green blanket…

LLL100%
LSS60%
LSS (OURS)40%
LSS20%

Late-stage boundary selection

Large-model reintroduction point (% from the end). We vary the point (from the end) at which the large model is reintroduced. LSL-OURS (20%) is generally stable, while LSS (0%) shows clear artifacts, suggesting late denoising benefits from higher capacity.

Prompt: A black and white dog sitting on a sofa with a green ball in front of it. The dog is looking at the camera and moving its head around. The dog then picks up the ball and starts chewing on it...

LLL
LSL40%
LSL (OURS)20%
LSS0%