Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity.
We present Syncphony, which generates 380×640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio.
Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality.
Given an initial frame, a text prompt, and an audio waveform, the model autoregressively predicts each video latent through iterative denoising. At each timestep, it conditions on previously generated latents, while receiving multimodal guidance: text features via joint self-attention, and audio features via cross-attention. For brevity, latents are visualized as RGB frames, but they are spatiotemporal features extracted by VAE.
Generated videos with shifted audio and the same image input demonstrate variations in motion depending on the alignment of audio cues.
Qualitative comparison of videos generated by Syncphony (Ours), AVSyncD, and Pyramid Flow (fine-tuned), which is a variant of our model without audio cross-attention layers. Our method generates motions that are temporally aligned with audio events and produces clearer motion dynamics and stable appearances.
Incorporating Motion-aware Loss improves both the magnitude and temporal precision of motion, particularly at the onset and offset of dynamic actions.
Applying Audio Sync Guidance captures (ASG) subtle yet important sounds and generates motion precisely aligned with the audio cues (Full Model (w=2) hits the exact target).
Applying Audio RoPE to the audio features shows tighter temporal alignment between motion and sound events.