OmniHuman-1.5

Instilling an Active Mind in Avatars via Cognitive Simulation

Jianwen Jiang^*†, Weihong Zeng^*, Zerong Zheng^*, Jiaqi Yang^*, Chao Liang^*, Wang Liao^*, Han Liang^*,

Weifeng Chen, Xing Wang, Yuan Zhang, Mingyuan Gao

^*Equal contribution,^†Project lead and corresponding author

Introducing Dual-System Avatars

From a single image and a voice track, OmniHuman-1.5 generates expressive character animations that are coherent with the speech's rhythm, prosody and semantic content, with optional text prompts for further refinement. Inspired by the mind's "System 1 and System 2" cognitive theory, our architecture bridges a Multimodal Large Language Model and a Diffusion Transformer, simulating two distinct modes of thought: slow, deliberate planning and fast, intuitive reaction. This powerful synergy enables the generation of videos over one minute with highly dynamic motion, continuous camera movement, and complex multi-character interactions.

Note: The results shown were generated by our original research model. The video begins with the input image, and subtitles were added manually. While compressed, the file may still take a moment to load. Thank you for your patience.

Rhythmic Performances

This versatility extends to music, where our framework crafts a soulful digital singer from a single image and song. Driven by our Reasoning Module, the motion captures rich musical expressions beyond mere lip-sync, including natural pauses and breaks, adeptly handling styles from solo ballads to upbeat concerts.

Emotional Performances

From a single image and audio, our framework brings a digital actor to life. By analyzing the audio's emotional subtext without text prompts, it generates captivating, cinematic performances with full dramatic range, from explosive anger to heartfelt confession. The cases on this page are presented solely to showcase the capabilities of our research model.

Context-Aware Audio-Driven Animation

Our model transcends simple lip-sync and repetitive gestures by interpreting audio's semantic context, enabling characters to exhibit genuine emotional shifts and match gestures to their words and intent, as if driven by a will of their own.

Text-Guided Multimodal Animation

Our carefully designed framework accepts text prompts and demonstrates exceptional prompt-following, enabling precise control over object generation, camera movements, and specific actions, all while maintaining perfect audio sync.

The camera follows as the man turns to face it and walks forward, singing in ecstasy. At times he touches his collar with both hands; at others he spreads his arms and lifts his head, lost in rapture.

Prompt: The camera zooms in rapidly to a close-up of the woman’s shoes, then slowly pans up to her face. The beautiful girl sways her body charmingly.

The man sang intoxicatedly. He first glanced out the window, then placed his left hand on his chest as if in rapture. Next, he stood up and walked forward along the train aisle, once again placing his left hand on his chest.

Prompt: Handheld camera. A woman looks into the distance. In the background, there are fireworks. The wind is blowing her hair and clothes. It has the feel of an arthouse film, a lonely atmosphere, and is shot on film

Prompt: Circle the camera to the right. When the camera focuses on the man's face, hold it still for a low, somber mood

Prompt: The character's face moves forward, they look at the camera, then reach out and poke the camera lens. After that, the camera moves backward, and the character crosses their arms and starts to talk.

Prompt: Man takes cigarette out, looks to camera, speaks.

Prompt: A penguin is dancing. A pair of hands puts a cool pair of sunglasses on it. A band is playing, and the audience is cheering.

Prompt: A chick wearing sunglasses, holding two guns, talking, with an evil vibe.

Multi-Person Scene Performance

Our framework extends to complex multi-person scenes. It generates dynamic group dialogues and ensemble performances by routing separate audio tracks to the correct characters in a single frame.

More Results on Diverse Inputs

Our model demonstrates true robustness by generating high-quality, synchronized video across an incredible diversity of subjects, including real animals, anthropomorphic characters, and stylized cartoons.

Ethical Considerations

The images and audios in these demos are from public sources or generated by models and are used solely to demonstrate the capabilities of this research. This includes, in particular, analyzing how the model generates corresponding expressions and movements in response to a diverse range of inputs, thereby showcasing the potential technical contributions and academic value of our proposed framework. If there are any concerns, please contact us (jianwen.alan@gmail.com) and we will delete it in time. The template for this webpage is adapted from APT Series . Special thanks to the author for sharing not only his wonderful template but also insightful research works.

BibTeX

If you find this project useful for your research, you can cite us and check out other related works from our team.



@inproceedings{lin2025omnihuman1,
    title     = {OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models},
    author    = {Lin, Gaojie and Jiang, Jianwen and Yang, Jiaqi and Zheng, Zerong and Liang, Chao and others},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2025},
    note      = {Highlight}
}

@inproceedings{jiang2025loopy,
    title     = {Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency},
    author    = {Jiang, Jianwen and Liang, Chao and Yang, Jiaqi and Lin, Gaojie and Zhong, Tianyun and Zheng, Yanbo},
    booktitle = {International Conference on Learning Representations (ICLR)},
    year      = {2025},
    note      = {Oral presentation}
}

@inproceedings{lin2025cyberhost,
    title     = {CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention},
    author    = {Lin, Gaojie and Jiang, Jianwen and Liang, Chao and Zhong, Tianyun and Yang, Jiaqi and Zheng, Yanbo},
    booktitle = {International Conference on Learning Representations (ICLR)},
    year      = {2025},
    note      = {Oral presentation}
}