说明文档

<h1 align='center'>Hallo3: 基于扩散Transformer网络的高动态逼真人像动画</h1>

<div align='center'> <a href='https://github.com/cuijh26' target='_blank'>Jiahao Cui</a><sup>1</sup> <a href='https://github.com/crystallee-ai' target='_blank'>Hui Li</a><sup>1</sup> <a href='https://github.com/subazinga' target='_blank'>Yun Zhan</a><sup>1</sup> <a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1</sup> <a href='https://github.com/Kaihui-Cheng' target='_blank'>Kaihui Cheng</a><sup>1</sup> <a href='https://github.com/mayuqi7777' target='_blank'>Yuqi Ma</a><sup>1</sup> <a href='https://github.com/AricGamma' target='_blank'>Shan Mu</a><sup>1</sup> </div> <div align='center'> <a href='https://hangz-nju-cuhk.github.io/' target='_blank'>Hang Zhou</a><sup>2</sup> <a href='https://jingdongwang2017.github.io/' target='_blank'>Jingdong Wang</a><sup>2</sup> <a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1✉️</sup> </div>

📸 展示

访问我们的项目页面查看更多案例。

⚙️ 安装

系统要求：Ubuntu 20.04/Ubuntu 22.04，Cuda 12.1
测试过的GPU：H100

下载代码：

  git clone https://github.com/fudan-generative-vision/hallo3
  cd hallo3

创建conda环境：

  conda create -n hallo python=3.10
  conda activate hallo

使用 pip 安装依赖包

  pip install -r requirements.txt

此外，还需要安装ffmpeg：

  apt-get install ffmpeg

📥 下载预训练模型

您可以从我们的 HuggingFace仓库轻松获取推理所需的所有预训练模型。

使用 huggingface-cli 下载模型：

cd $ProjectRootDir
pip install huggingface-cli
huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models

或者您也可以从各自的源仓库分别下载：

hallo3：我们的检查点。
Cogvidex：Cogvideox-5b-i2v预训练模型，包含transformer和3d vae
t5-v1_1-xxl：文本编码器，您可以从 text_encoder 和 tokenizer 下载
audio_separator：Kim Vocal_2 MDX-Net人声分离模型。
wav2vec：来自Facebook的wav音频转向量模型。
insightface：2D和3D人脸分析模型，放置到 pretrained_models/face_analysis/models/。（感谢deepinsight）
face landmarker：来自mediapipe的人脸检测和网格模型，放置到 pretrained_models/face_analysis/models。

最终，这些预训练模型应该按如下方式组织：

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- cogvideox-5b-i2v-sat/
|   |-- transformer/
|       |--1/
|           |-- mp_rank_00_model_states.pt  
|       `--latest
|   `-- vae/
|           |-- 3d-vae.pt
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # 来自mediapipe的人脸landmarker模型
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- hallo3
|   |--1/
|       |-- mp_rank_00_model_states.pt 
|   `--latest
|-- t5-v1_1-xxl/
|   |-- added_tokens.json
|   |-- config.json
|   |-- model-00001-of-00002.safetensors
|   |-- model-00002-of-00002.safetensors
|   |-- model.safetensors.index.json
|   |-- special_tokens_map.json
|   |-- spiece.model
|   |-- tokenizer_config.json
|   
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json

🛠️ 准备推理数据

Hallo3对推理的输入数据有以下几个简单要求：

参考图像必须是1:1或3:2的宽高比。
驱动音频必须是WAV格式。
音频必须是英语，因为我们的训练数据集只包含英语。
确保音频中的人声清晰；背景音乐可以接受。

🎮 运行推理

只需运行 scripts/inference_long_batch.sh：

bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output

动画结果将保存在 ./output。您可以在examples文件夹找到更多推理示例。

训练

准备训练数据

将原始视频按以下目录结构组织：

dataset_name/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   `-- 0003.mp4
|-- caption/
|   |-- 0001.txt
|   |-- 0002.txt
|   `-- 0003.txt

您可以使用任意dataset_name，但请确保videos目录和caption目录按上述方式命名。

接下来，使用以下命令处理视频：

bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}

训练

更新配置YAML文件 configs/sft_s1.yaml 和 configs/sft_s2.yaml 中的数据元路径设置：

#sft_s1.yaml
train_data: [
    \"./data/output_name.json\"
]

#sft_s2.yaml
train_data: [
    \"./data/output_name.json\"
]

使用以下命令开始训练：

# 阶段1
bash scripts/finetune_multi_gpus_s1.sh

# 阶段2
bash scripts/finetune_multi_gpus_s2.sh

📝 引用

如果您觉得我们的工作对您的研究有帮助，请考虑引用本论文：

@misc{cui2024hallo3,
	title={Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks}, 
	author={Jiahao Cui and Hui Li and Yun Zhang and Hanlin Shang and Kaihui Cheng and Yuqi Ma and Shan Mu and Hang Zhou and Jingdong Wang and Siyu Zhu},
	year={2024},
	eprint={2412.00733},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
}

⚠️ 社会风险与缓解措施

由音频驱动的人像动画技术的发展带来了社会风险，例如创建逼真人像可能被滥用于深度伪造的伦理问题。为了缓解这些风险，建立伦理准则和负责任的使用实践至关重要。使用个人图像和声音也会引发隐私和同意问题。解决这些问题需要透明的数据使用政策、知情同意以及保护隐私权。通过应对这些风险并实施缓解措施，本研究旨在确保这项技术的负责任和道德发展。

🤗 致谢

本模型是基于 CogVideo-5B I2V 模型的微调衍生版本。CogVideo-5B是由CogVideoX团队开发的开源文本到视频生成模型。其原始代码和模型参数受 CogVideo-5B许可证管辖。

作为CogVideo-5B的衍生作品，本模型的使用、分发和修改必须遵守CogVideo-5B的许可条款。

👏 社区贡献者

感谢所有帮助改进本项目的贡献者！

fudan-generative-ai/hallo3

作者 fudan-generative-ai

image-to-video

↓ 0 ♥ 64

创建时间: 2024-11-27 06:47:55+00:00

更新时间: 2025-01-20 12:14:16+00:00

在 Hugging Face 上查看

文件 (35)

.gitattributes

README.md

audio_separator/Kim_Vocal_2.onnx ONNX

audio_separator/download_checks.json

audio_separator/mdx_model_data.json

audio_separator/vr_model_data.json

cogvideox-5b-i2v-sat/transformer/1/mp_rank_00_model_states.pt

cogvideox-5b-i2v-sat/transformer/latest

cogvideox-5b-i2v-sat/vae/3d-vae.pt

face_analysis/models/1k3d68.onnx ONNX

face_analysis/models/2d106det.onnx ONNX

face_analysis/models/buffalo_l.zip

face_analysis/models/face_landmarker_v2_with_blendshapes.task

face_analysis/models/genderage.onnx ONNX

face_analysis/models/glintr100.onnx ONNX

face_analysis/models/scrfd_10g_bnkps.onnx ONNX

hallo3/1/mp_rank_00_model_states.pt

hallo3/latest

t5-v1_1-xxl/added_tokens.json

t5-v1_1-xxl/config.json

t5-v1_1-xxl/model-00001-of-00002.safetensors

t5-v1_1-xxl/model-00002-of-00002.safetensors

t5-v1_1-xxl/model.safetensors.index.json

t5-v1_1-xxl/special_tokens_map.json

t5-v1_1-xxl/spiece.model

t5-v1_1-xxl/tokenizer_config.json

wav2vec/wav2vec2-base-960h/.gitattributes

wav2vec/wav2vec2-base-960h/README.md

wav2vec/wav2vec2-base-960h/config.json

wav2vec/wav2vec2-base-960h/feature_extractor_config.json

wav2vec/wav2vec2-base-960h/model.safetensors

wav2vec/wav2vec2-base-960h/preprocessor_config.json

wav2vec/wav2vec2-base-960h/special_tokens_map.json

wav2vec/wav2vec2-base-960h/tokenizer_config.json

wav2vec/wav2vec2-base-960h/vocab.json