返回模型
说明文档
👉🏻 CosyVoice 👈🏻
Fun-CosyVoice 3.0: 演示; 论文; Modelscope; Huggingface; CV3评测
CosyVoice 2.0: 演示; 论文; Modelscope; HuggingFace
CosyVoice 1.0: 演示; 论文; Modelscope; HuggingFace
亮点🔥
Fun-CosyVoice 3.0 是一个基于大语言模型(LLM)的先进文本转语音(TTS)系统,在内容一致性、说话人相似度和韵律自然度方面超越了前代产品(CosyVoice 2.0)。它专为野外环境下的零样本多语言语音合成而设计。
核心特性
- 语言覆盖:覆盖9种常用语言(中文、英语、日语、韩语、德语、西班牙语、法语、意大利语、俄语),18种以上中文方言/口音(广东、闽南、四川、东北、山西、陕西、上海、天津、山东、宁夏、甘肃等),同时支持多语言/跨语言零样本声音克隆。
- 内容一致性与自然度:在内容一致性、说话人相似度和韵律自然度方面达到业界领先水平。
- 发音修复:支持中文拼音和英语CMU音素的发音修复,提供更强的可控性,适用于生产环境。
- 文本规范化:支持数字、特殊符号和各种文本格式的朗读,无需传统前端模块。
- 双向流式:同时支持文本输入流式和音频输出流式,在保持高质量音频输出的同时实现低至150毫秒的延迟。
- 指令支持:支持多种指令,如语言、方言、情感、语速、音量等。
路线图
-
[x] 2025/12
- [x] 发布 Fun-CosyVoice3-0.5B-2512 基础模型、RL模型及其训练/推理脚本
- [x] 发布 Fun-CosyVoice3-0.5B modelscope gradio space
-
[x] 2025/08
- [x] 感谢 NVIDIA Yuekai Zhang 的贡献,添加 triton trtllm 运行时支持和 cosyvoice2 grpo 训练支持
-
[x] 2025/07
- [x] 发布 Fun-CosyVoice 3.0 评测集
-
[x] 2025/05
- [x] 添加 CosyVoice2-0.5B vllm 支持
-
[x] 2024/12
- [x] 发布 25hz CosyVoice2-0.5B
-
[x] 2024/09
- [x] 25hz CosyVoice-300M 基础模型
- [x] 25hz CosyVoice-300M 声音转换功能
-
[x] 2024/08
- [x] 重复感知采样(RAS)推理,提升LLM稳定性
- [x] 流式推理模式支持,包括 kv cache 和 sdpa 用于 rtf 优化
-
[x] 2024/07
- [x] Flow matching 训练支持
- [x] 当 ttsfrd 不可用时支持 WeTextProcessing
- [x] Fastapi 服务端和客户端
评测
| 模型 | 开源 | 模型大小 | test-zh<br>CER (%) ↓ | test-zh<br>说话人相似度 (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>说话人相似度 (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>说话人相似度 (%) ↑ |
|---|---|---|---|---|---|---|---|---|
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
安装
克隆并安装
-
克隆仓库
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git # 如果由于网络原因克隆子模块失败,请运行以下命令直到成功 cd CosyVoice git submodule update --init --recursive -
安装 Conda:请参阅 https://docs.conda.io/en/latest/miniconda.html
-
创建 Conda 环境:
conda create -n cosyvoice -y python=3.10 conda activate cosyvoice pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com # 如果遇到 sox 兼容性问题 # ubuntu sudo apt-get install sox libsox-dev # centos sudo yum install sox sox-devel
模型下载
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
可选地,您可以解压 ttsfrd 资源并安装 ttsfrd 包以获得更好的文本规范化性能。
注意这一步不是必须的。如果您没有安装 ttsfrd 包,我们将默认使用 wetext。
cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
基本用法
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio
""" CosyVoice3 用法,查看 https://funaudiollm.github.io/cosyvoice3/ 了解更多详情
"""
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
# 英语零样本用法
for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 中文零样本用法
for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 细粒度控制,支持的控制标签请查看 cosyvoice/tokenizer/tokenizer.py#L280
for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 指令用法,支持的指令请查看 cosyvoice/utils/common.py#L28
for i, j in enumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 热修复用法
for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
讨论与交流
您可以直接在 Github Issues 上讨论。
您也可以扫描二维码加入我们的官方钉钉交流群。
<img src="./asset/dingding.png" width="250px">
致谢
- 我们借鉴了大量 FunASR 的代码。
- 我们借鉴了大量 FunCodec 的代码。
- 我们借鉴了大量 Matcha-TTS 的代码。
- 我们借鉴了大量 AcademiCodec 的代码。
- 我们借鉴了大量 WeNet 的代码。
引用
@article{du2024cosyvoice,
title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
journal={arXiv preprint arXiv:2407.05407},
year={2024}
}
@article{du2024cosyvoice,
title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
journal={arXiv preprint arXiv:2412.10117},
year={2024}
}
@article{du2025cosyvoice,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}
@inproceedings{lyu2025build,
title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--2},
year={2025},
organization={IEEE}
}
免责声明
以上内容仅供学术用途,旨在展示技术能力。部分示例来源于网络。如果任何内容侵犯了您的权益,请联系我们要求删除。
rdjpoi/Fun-CosyVoice3-0.5B-2512
作者 rdjpoi
text-to-speech
↓ 1
♥ 0
创建时间: 2026-02-14 22:52:15+00:00
更新时间: 2026-02-14 22:52:16+00:00
在 Hugging Face 上查看文件 (20)
.gitattributes
CosyVoice-BlankEN/config.json
CosyVoice-BlankEN/generation_config.json
CosyVoice-BlankEN/merges.txt
CosyVoice-BlankEN/model.safetensors
CosyVoice-BlankEN/tokenizer_config.json
CosyVoice-BlankEN/vocab.json
README.md
asset/dingding.png
campplus.onnx
ONNX
config.json
configuration.json
cosyvoice3.yaml
flow.decoder.estimator.fp32.onnx
ONNX
flow.pt
hift.pt
llm.pt
llm.rl.pt
speech_tokenizer_v3.batch.onnx
ONNX
speech_tokenizer_v3.onnx
ONNX