Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

Yonsei University

Abstract

Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity.

We present Syncphony, which generates 380×640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio.

Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality.

Framework Overview

Given an initial frame, a text prompt, and an audio waveform, the model autoregressively predicts each video latent through iterative denoising. At each timestep, it conditions on previously generated latents, while receiving multimodal guidance: text features via joint self-attention, and audio features via cross-attention. For brevity, latents are visualized as RGB frames, but they are spatiotemporal features extracted by VAE.

Shifted Audio with the Same Image Input

Generated videos with shifted audio and the same image input demonstrate variations in motion depending on the alignment of audio cues.

machine gun shooting
Audio 1
Audio 2
striking bowling
Audio 1
Audio 2
playing trombone
Audio 1
Audio 2

Comparison

Qualitative comparison of videos generated by Syncphony (Ours), AVSyncD, and Pyramid Flow (fine-tuned), which is a variant of our model without audio cross-attention layers. Our method generates motions that are temporally aligned with audio events and produces clearer motion dynamics and stable appearances.

frog croaking
lions roaring
machine gun shooting
playing cello
playing violin fiddle
dog barking
cap gun shooting
chicken crowing
playing trombone
toilet flushing
baby babbling crying
hammering

Ablations

1. Motion-aware Loss

Incorporating Motion-aware Loss improves both the magnitude and temporal precision of motion, particularly at the onset and offset of dynamic actions.

lions roaring
frog croaking

2. Audio Sync Guidance

Applying Audio Sync Guidance captures (ASG) subtle yet important sounds and generates motion precisely aligned with the audio cues (Full Model (w=2) hits the exact target).

hitting with a stick
hitting with a stick

3. Audio RoPE

Applying Audio RoPE to the audio features shows tighter temporal alignment between motion and sound events.

cap gun shooting
playing trombone