Introducing Dual-System Avatars
From a single image and a voice track, OmniHuman-1.5 generates expressive character animations that are
coherent with the speech's rhythm, prosody and semantic content, with optional text prompts for further refinement. Inspired by the mind's "System 1 and System 2" cognitive theory, our architecture bridges a Multimodal Large Language Model and a Diffusion Transformer, simulating two distinct modes of thought: slow, deliberate planning and fast, intuitive reaction. This powerful synergy enables the generation of
videos over one minute with highly dynamic motion, continuous camera movement, and complex multi-character interactions.
Note: The results shown were generated by our original research model. The video begins with the input image, and subtitles were added manually. While compressed, the file may still take a moment to load. Thank you for your patience.
Context-Aware Audio-Driven Animation
Our model transcends simple lip-sync and repetitive gestures by interpreting audio's semantic context, enabling characters to exhibit genuine emotional shifts and match gestures to their words and intent, as if driven by a will of their own.
Text-Guided Multimodal Animation
Our carefully designed framework accepts text prompts and demonstrates exceptional prompt-following, enabling precise control over object generation, camera movements, and specific actions, all while maintaining perfect audio sync.
The camera follows as the man turns to face it and walks forward, singing in ecstasy. At times he touches his collar with both hands; at others he spreads his arms and lifts his head, lost in rapture.
Prompt: The camera zooms in rapidly to a close-up of the woman’s shoes, then slowly pans up to her face. The beautiful girl sways her body charmingly.
The man sang intoxicatedly. He first glanced out the window, then placed his left hand on his chest as if in rapture. Next, he stood up and walked forward along the train aisle, once again placing his left hand on his chest.
Prompt: Handheld camera. A woman looks into the distance. In the background, there are fireworks. The wind is blowing her hair and clothes. It has the feel of an arthouse film, a lonely atmosphere, and is shot on film
Prompt: Circle the camera to the right. When the camera focuses on the man's face, hold it still for a low, somber mood
Prompt: The character's face moves forward, they look at the camera, then reach out and poke the camera lens. After that, the camera moves backward, and the character crosses their arms and starts to talk.
Prompt: Man takes cigarette out, looks to camera, speaks.
Prompt: A penguin is dancing. A pair of hands puts a cool pair of sunglasses on it. A band is playing, and the audience is cheering.
Prompt: A chick wearing sunglasses, holding two guns, talking, with an evil vibe.
More Results on Diverse Inputs
Our model demonstrates true robustness by generating high-quality, synchronized video across an incredible diversity of subjects, including real animals, anthropomorphic characters, and stylized cartoons.
Ethical Considerations
The images and audios in these demos are from public sources or generated by models and are used solely to demonstrate the capabilities of this research. This includes, in particular, analyzing how the model generates corresponding expressions and movements in response to a diverse range of inputs, thereby showcasing the potential technical contributions and academic value of our proposed framework. If there are any concerns, please contact us (jianwen.alan@gmail.com) and we will delete it in time. The template for this webpage is adapted from
APT Series . Special thanks to the author for sharing not only his wonderful template but also insightful research works.
BibTeX
If you find this project useful for your research, you can cite us and check out other related works from our team.
@inproceedings{lin2025omnihuman1,
title = {OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models},
author = {Lin, Gaojie and Jiang, Jianwen and Yang, Jiaqi and Zheng, Zerong and Liang, Chao and others},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025},
note = {Highlight}
}
@inproceedings{jiang2025loopy,
title = {Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency},
author = {Jiang, Jianwen and Liang, Chao and Yang, Jiaqi and Lin, Gaojie and Zhong, Tianyun and Zheng, Yanbo},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2025},
note = {Oral presentation}
}
@inproceedings{lin2025cyberhost,
title = {CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention},
author = {Lin, Gaojie and Jiang, Jianwen and Liang, Chao and Zhong, Tianyun and Yang, Jiaqi and Zheng, Yanbo},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2025},
note = {Oral presentation}
}