说明文档
<font size=5>MuseV:基于视觉条件并行去噪的无限长度高保真虚拟人视频生成</font> </br> Zhiqiang Xia <sup>*</sup>, Zhaokang Chen<sup>*</sup>, Bin Wu<sup>†</sup>, Chao Li, Kwok-Wai Hung, Chao Zhan, Yingjie He, Wenjiang Zhou (<sup>*</sup>共同第一作者,<sup>†</sup>通讯作者,benbinwu@tencent.com) </br> Lyra Lab,腾讯音乐娱乐
github huggingface HuggingfaceSpace 项目主页 技术报告(即将推出)
我们自 2023年3月 起就确立了 世界模拟器愿景,坚信扩散模型能够模拟世界。MuseV 是我们在 2023年7月 左右达成的里程碑成果。受到 Sora 进展的震撼,我们决定开源 MuseV,希望它能惠及社区。接下来我们将转向更有前景的扩散+Transformer 方案。
我们即将发布 MuseTalk,这是一个实时高质量的唇形同步模型,可以与 MuseV 配合使用,形成完整的虚拟人生成解决方案。敬请期待!
概述
MuseV 是一个基于扩散模型的虚拟人视频生成框架,具有以下特点:
- 采用全新的 视觉条件并行去噪方案,支持 无限长度 生成。
- 提供基于人体数据集训练的虚拟人视频生成检查点。
- 支持图像生成视频(Image2Video)、文本生成图像再生成视频(Text2Image2Video)、视频生成视频(Video2Video)。
- 兼容 Stable Diffusion 生态,包括
base_model、lora、controlnet等。 - 支持多参考图像技术,包括
IPAdapter、ReferenceOnly、ReferenceNet、IPAdapterFaceID。 - 训练代码(即将推出)。
动态
- [2024/03/27] 发布
MuseV项目及训练模型musev、muse_referencenet、muse_referencenet_pose。
模型
模型结构概览

并行去噪

案例
所有帧均由 text2video 模型生成,未经任何后处理。
以下案例可在 configs/tasks/example.yaml 中找到
文本/图像生成视频
人物
<!-- 2列,一张图片,一个视频 -->
<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td>图像</td> <td>视频</td> <td>提示词</td> </tr>
<tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/cTQX49v7GT7GA-NEHj5vK.jpeg width="200"> </td> <td > <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/U4sHVv_fYVbFHveS7Sw7h.mp4"></video> </td> <td>(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face, soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3) </td> </tr>
<tr>
<td>
<img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/SPSgJpptVM4Qm11nqD07C.jpeg width="200">
</td>
<td>
<video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/XMya1FCmRs6USzKp9qrAy.mp4"></video>
</td>
<td>
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
</td>
</tr>
<tr>
<td>
<img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/l6chBPhUKeOLbnXnX-ewG.jpeg width="200">
</td>
<td>
<video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/pPnU6QXgWuWxw5SWZdl8N.mp4"></video>
</td>
<td>
(masterpiece, best quality, highres:1), peaceful beautiful sea scene
</td>
</tr>
<tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/VQMEeGc1wTuiATtQLJjer.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/_9ZQEebUlmSNtXMJKiGPu.mp4"></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr> <!-- guitar --> <tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/Fk_eec7vqq4NfAYVPNLI-.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/p0gWRwZTDOrbPf8mZOphG.mp4"></video> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr> <tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/Zcc5xj1-lA_EPS7gvJu99.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/4mT4KL3q4FzyQQKJfgXVG.mp4"></video> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr> <tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/iT5OsCpRNnntuS0TH1cG5.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/6RWBw73-oE4rJH808FzIK.mp4"> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr> <tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/Ym582ZF-MbYkRW1sAE5r3.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/lYqpeRcIiK7WEqRe4d8dZ.mp4"></video> </td> <td> (masterpiece, best quality, highres:1), playing guitar </td> </tr> <!-- famous people --> <tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/JsAjbl4AeYz089kWHjjUJ.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/KCjF9HutBo7el15gm3YV3.mp4"></video> </td> <td> (masterpiece, best quality, highres:1),(1man, solo:1),(beautiful face, soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3) </td> </tr>
<tr>
<td>
<img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/v-X3Wrkm14YwLGGloNlMK.jpeg width="200">
</td>
<td>
<video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/P_Y5jUO1EJ6n3Z4qd1xh1.mp4"></video>
</td>
<td>
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
</td>
</tr>
<tr>
<td>
<img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/cNe41NV1OfLF5AmMKD6mi.jpeg width="200">
</td>
<td>
<video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/L6mA8uuckRJhzhAJHayJT.mp4"></video>
</td>
<td>
(masterpiece, best quality, highres:1),(1man, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
</td>
</tr>
<tr>
<td>
<img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/6iHIaa15eBgop7BsE0Nps.jpeg width="200">
</td>
<td>
<video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/foPX3iRk2TzjRl_V52T21.mp4"></video>
</td>
<td>
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
</td>
</tr>
<tr>
<td>
<img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/R7rp7t4DPkws0dXRxi0bf.jpeg width="200">
</td>
<td>
<video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/tGTSNe9i08pvTMNe6SOBg.mp4"></video>
</td>
<td>
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
</td>
</tr>
</table >
场景
<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td>图像</td> <td>视频</td> <td>提示词</td> </tr>
<tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/EIembLBwySZTBjFZStFr_.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/jfHV6a186BzAu-Tz0ET1o.mp4"></video> </td> <td> (masterpiece, best quality, highres:1), peaceful beautiful waterfall, an endless waterfall </td> </tr>
<tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/u2_mzl5m-Z0nwSYFcTLxs.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/eXrRVejCZs3QaA4-JK6Le.mp4"></video> </td> <td>(masterpiece, best quality, highres:1), peaceful beautiful river </td> </tr>
<tr> <td> <img src=https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/NIHfIi7onyJ5ELetE2f_Z.jpeg width="200"> </td> <td> <video width="400" controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/eaZyhukcfaoY7dIGp2cs6.mp4"></video> </td> <td>(masterpiece, best quality, highres:1), peaceful beautiful sea scene </td> </tr> </table >
视频中间帧生成视频
姿态生成视频(pose2video)
在 duffy 案例中,视觉条件帧的姿态与控制视频第一帧的姿态不一致。posealign 模块可以解决这个问题。
<table class="center"> <tr style="font-weight: bolder;text-align:center;"> <td>图像</td> <td>视频</td> <td>提示词</td> </tr>
<tr> <td> <img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/fX1ND0YqDp1LV0LEh2eFN.png" width="200"> <img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/pe2aQt5FU66tplNZCOZaB.png" width="200"> </td> <td> <video width="900" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/IMPIDjR7-w5A_xc6ZHIzT.mp4" controls preload></video> </td> <td> (masterpiece, best quality, highres:1) </td> </tr>
<tr> <td> <img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/FlLWP8IqM_X2K4hXAOPHO.png" width="200"> </td> <td> <video width="900" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/OT22TR7e7Lcoxci9aoDBA.mp4" controls preload></video> </td> <td> (masterpiece, best quality, highres:1) </td> </tr> </table >
MuseTalk
<table class="center"> <tr style="font-weight: bolder;"> <td>名称</td> <td>视频</td> </tr> <tr> <td> 说话 </td> <td> <video width="350" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/wUhNS7j5UQ28eXu4JVQfF.mp4" controls preload></video> </td> </tr>
<tr> <td> 说话 </td> <td> <video width="350" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/bH-j15douHJcZXIEvMIDa.mp4" controls preload></video> </td> </tr>
<tr> <td> 唱歌 </td> <td> <video width="350" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/l5ZTbUQ11gK6FUtoaRz4S.mp4" controls preload></video> </td> </tr> </table >
快速开始
请参考 MuseV
致谢
- MuseV 大量参考了 TuneAVideo、diffusers、Moore-AnimateAnyone、animatediff、IP-Adapter、AnimateAnyone、VideoFusion、insightface。
- MuseV 基于
ucf101和webvid数据集构建。
感谢开源!
局限性
目前仍存在许多局限性,包括:
- 泛化能力不足。某些视觉条件图像效果良好,某些效果较差。某些 t2i 预训练模型效果良好,某些效果较差。
- 视频生成类型有限,运动范围有限,部分原因是训练数据类型有限。发布的
MuseV在约 60K 对分辨率为512*320的人体文本-视频对上进行了训练。MuseV在较低分辨率下具有更大的运动范围,但视频质量较低。MuseV倾向于在保持高视频质量的同时生成较小的运动范围。在更大、更高分辨率、更高质量的文本-视频数据集上训练可能会使MuseV更好。 - 由于
webvid的原因,可能会出现水印。使用无水印的更干净数据集可以解决这个问题。 - 长视频生成类型有限。视觉条件并行去噪可以解决视频生成的累积误差问题,但当前方法仅适用于相对固定的摄像机场景。
- 由于时间和资源有限,referencenet 和 IP-Adapter 训练不足。
- 代码结构不够完善。
MuseV支持丰富且动态的功能,但代码复杂且未经重构。熟悉需要时间。
<!-- # Contribution 暂时不需要组织开源共建 -->
引用
@article{musev,
title={MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising},
author={Xia, Zhiqiang and Chen, Zhaokang and Wu, Bin and Li, Chao and Hung, Kwok-Wai and Zhan, Chao and He, Yingjie and Zhou, Wenjiang},
journal={arxiv},
year={2024}
}
免责声明/许可协议
代码:MuseV 的代码基于 MIT 许可证发布。学术和商业用途均无限制。模型:训练模型仅可用于非商业研究目的。其他开源模型:使用的其他开源模型必须遵守其许可协议,如insightface、IP-Adapter、ft-mse-vae等。- 测试数据收集自互联网,仅可用于非商业研究目的。
AIGC:本项目致力于对 AI 驱动的视频生成领域产生积极影响。用户被授予使用此工具创作视频的自由,但应遵守当地法律并负责任地使用。开发者不对用户的潜在滥用行为承担任何责任。
TMElyralab/MuseV
作者 TMElyralab
创建时间: 2024-03-19 07:18:42+00:00
更新时间: 2024-04-01 15:48:25+00:00
在 Hugging Face 上查看