Project Logo

Multi-human Interactive Talking Dataset

Show Lab, National University of Singapore
PDF 转换的图片

Abstract

Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions.To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research.

Multi-human Interactive Talking(MIT) Dataset

PDF 转换的图片

We present a high-quality dataset for multi-human interactive talking video generation, comprising over 12 hours of high-resolution conversational clips with diverse interaction patterns and approximately 200 distinct identities. The dataset was constructed through a fully automated pipeline, facilitating future scale-up with minimal manual intervention.

The following videos are generated by CovOG and post-processed using RIFE to enhance motion smoothness. They demonstrate the effectiveness of multi-human interactive talking video generation and highlight the utility of the MIT dataset in supporting complex conversational scenarios.

Two Human Conversation Result

图片1 图片2
图片1 图片2
图片1 图片2
图片1 图片2

Multiple Human Conversation Result

图片1 图片2
图片1 图片2

Cross-modal Result (Left: AnimateAnyone; Right: CovOG)