返回模型
说明文档
👉🏻 CosyVoice 👈🏻
Fun-CosyVoice 3.0: 演示; 论文; Modelscope; Huggingface; CV3-Eval
CosyVoice 2.0: 演示; 论文; Modelscope; HuggingFace
CosyVoice 1.0: 演示; 论文; Modelscope; HuggingFace
亮点🔥
Fun-CosyVoice 3.0 是一个基于大语言模型(LLM)的先进文本转语音(TTS)系统,在内容一致性、说话人相似度和韵律自然度方面超越了前代(CosyVoice 2.0)。它专为真实场景下的零样本多语言语音合成而设计。
核心特性
- 语言覆盖:覆盖9种常用语言(中文、英语、日语、韩语、德语、西班牙语、法语、意大利语、俄语),18种以上中文方言/口音(广东、闽南、四川、东北、陕西、山西、上海、天津、山东、宁夏、甘肃等),同时支持多语言/跨语言零样本声音克隆。
- 内容一致性与自然度:在内容一致性、说话人相似度和韵律自然度方面达到业界领先水平。
- 发音填充:支持中文拼音和英文CMU音素的发音填充,提供更强的可控性,适合生产环境使用。
- 文本规范化:支持数字、特殊符号和各种文本格式的朗读,无需传统前端模块。
- 双向流式:同时支持文本输入流式和音频输出流式,在保持高质量音频输出的同时实现低至150ms的延迟。
- 指令支持:支持语言、方言、情感、语速、音量等多种指令控制。
路线图
-
[x] 2025/12
- [x] 发布 Fun-CosyVoice3-0.5B-2512 基础模型、RL模型及其训练/推理脚本
- [x] 发布 Fun-CosyVoice3-0.5B Modelscope Gradio Space
-
[x] 2025/08
- [x] 感谢 NVIDIA Yuekai Zhang 的贡献,添加 triton trtllm 运行时支持和 cosyvoice2 grpo 训练支持
-
[x] 2025/07
- [x] 发布 Fun-CosyVoice 3.0 评测集
-
[x] 2025/05
- [x] 添加 CosyVoice2-0.5B vllm 支持
-
[x] 2024/12
- [x] 发布 25hz CosyVoice2-0.5B
-
[x] 2024/09
- [x] 25hz CosyVoice-300M 基础模型
- [x] 25hz CosyVoice-300M 语音转换功能
-
[x] 2024/08
- [x] 重复感知采样(RAS)推理,提升 LLM 稳定性
- [x] 流式推理模式支持,包括 kv cache 和 sdpa 的 RTF 优化
-
[x] 2024/07
- [x] Flow matching 训练支持
- [x] 当 ttsfrd 不可用时支持 WeTextProcessing
- [x] Fastapi 服务端和客户端
评测
| 模型 | 开源 | 模型大小 | test-zh<br>CER (%) ↓ | test-zh<br>说话人相似度 (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>说话人相似度 (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>说话人相似度 (%) ↑ |
|---|---|---|---|---|---|---|---|---|
| 人类 | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
安装
克隆并安装
-
克隆仓库
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git # 如果由于网络问题导致子模块克隆失败,请运行以下命令直到成功 cd CosyVoice git submodule update --init --recursive -
安装 Conda:请参阅 https://docs.conda.io/en/latest/miniconda.html
-
创建 Conda 环境:
conda create -n cosyvoice -y python=3.10 conda activate cosyvoice pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com # 如果遇到 sox 兼容性问题 # ubuntu sudo apt-get install sox libsox-dev # centos sudo yum install sox sox-devel
模型下载
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
可选地,你可以解压 ttsfrd 资源并安装 ttsfrd 包以获得更好的文本规范化性能。
注意这一步不是必须的。如果你没有安装 ttsfrd 包,我们将默认使用 wetext。
cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
基本用法
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio
\"\"\" CosyVoice3 用法,查看 https://funaudiollm.github.io/cosyvoice3/ 获取更多详情
\"\"\"
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
# 英文零样本用法
for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 中文零样本用法
for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 细粒度控制,支持的控制标签请查看 cosyvoice/tokenizer/tokenizer.py#L280
for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 指令用法,支持的指令请查看 cosyvoice/utils/common.py#L28
for i, j in enumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 热修复用法
for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
讨论与交流
你可以直接在 Github Issues 上讨论。
你也可以扫描二维码加入我们的官方钉钉交流群。
<img src="./asset/dingding.png" width="250px">
致谢
- 我们借鉴了大量来自 FunASR 的代码。
- 我们借鉴了大量来自 FunCodec 的代码。
- 我们借鉴了大量来自 Matcha-TTS 的代码。
- 我们借鉴了大量来自 AcademiCodec 的代码。
- 我们借鉴了大量来自 WeNet 的代码。
引用
@article{du2024cosyvoice,
title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
journal={arXiv preprint arXiv:2407.05407},
year={2024}
}
@article{du2024cosyvoice,
title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
journal={arXiv preprint arXiv:2412.10117},
year={2024}
}
@article{du2025cosyvoice,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}
@inproceedings{lyu2025build,
title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--2},
year={2025},
organization={IEEE}
}
免责声明
以上内容仅供学术用途,旨在展示技术能力。部分示例来源于网络。如有任何内容侵犯了您的权益,请联系我们请求删除。
Pragmaticl/fun-cosyvoice3-0.5b-2512-model
作者 Pragmaticl
text-to-speech
↓ 1
♥ 0
创建时间: 2026-01-15 03:57:46+00:00
更新时间: 2026-01-15 03:58:36+00:00
在 Hugging Face 上查看文件 (19)
.gitattributes
CosyVoice-BlankEN/config.json
CosyVoice-BlankEN/generation_config.json
CosyVoice-BlankEN/merges.txt
CosyVoice-BlankEN/model.safetensors
CosyVoice-BlankEN/tokenizer_config.json
CosyVoice-BlankEN/vocab.json
README.md
asset/dingding.png
campplus.onnx
ONNX
config.json
configuration.json
cosyvoice3.yaml
flow.decoder.estimator.fp32.onnx
ONNX
flow.pt
hift.pt
llm.pt
llm.rl.pt
speech_tokenizer_v3.onnx
ONNX