arxiv:2411.19509

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Published on Nov 29, 2024

Authors:

Abstract

Ditto is a diffusion-based framework that achieves real-time, high-quality, and controllable talking head synthesis by using an explicit motion space and an optimized inference strategy that includes audio feature extraction, motion generation, and video synthesis.

AI-generated summary

Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.

View arXiv page View PDF GitHub 752 auto Add to collection

Community

Abhishekgt099

11 days ago

Abhishekgt099

11 days ago

Ultra-realistic cinematic YouTube thumbnail, 8K, war-news realism, dramatic wide-angle battlefield shot of southeastern Ukraine: burning village on the horizon, thick black smoke rising into a blood-red winter sky, shattered buildings and frozen muddy trenches in the foreground. A large semi-transparent close-up of Vladimir Putin on the left side, face tense and shadowed, lit by orange fire glow. On the right side, a determined close-up of Volodymyr Zelenskyy, cold blue lighting contrasting against the flames. Between them, a bold glowing red battle map overlay highlighting Zaporizhzhia region with arrows pushing westward. Add subtle drone silhouettes in the smoky sky and faint missile trails. High contrast, sharp details, intense shadows, depth of field, cinematic color grading. Large bold headline text in white and red: “UKRAINE BREAKING?” ubtext smaller in yellow: “Russia’s BIG Push”.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2411.19509

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.19509 in a dataset README.md to link it from this page.

Spaces citing this paper 3

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.