说明文档

SVG Banners

👉🏻 CosyVoice 👈🏻

Fun-CosyVoice 3.0: 演示; 论文; Modelscope; Huggingface; CV3评测

CosyVoice 2.0: 演示; 论文; Modelscope; HuggingFace

CosyVoice 1.0: 演示; 论文; Modelscope; HuggingFace

亮点🔥

Fun-CosyVoice 3.0 是一个基于大语言模型（LLM）的先进文本转语音（TTS）系统，在内容一致性、说话人相似度和韵律自然度方面超越了前代产品（CosyVoice 2.0）。它专为野外环境下的零样本多语言语音合成而设计。

核心特性

语言覆盖：覆盖9种常用语言（中文、英语、日语、韩语、德语、西班牙语、法语、意大利语、俄语），18种以上中文方言/口音（广东、闽南、四川、东北、山西、陕西、上海、天津、山东、宁夏、甘肃等），同时支持多语言/跨语言零样本声音克隆。
内容一致性与自然度：在内容一致性、说话人相似度和韵律自然度方面达到业界领先水平。
发音修复：支持中文拼音和英语CMU音素的发音修复，提供更强的可控性，适用于生产环境。
文本规范化：支持数字、特殊符号和各种文本格式的朗读，无需传统前端模块。
双向流式：同时支持文本输入流式和音频输出流式，在保持高质量音频输出的同时实现低至150毫秒的延迟。
指令支持：支持多种指令，如语言、方言、情感、语速、音量等。

路线图

[x] 2025/12
- [x] 发布 Fun-CosyVoice3-0.5B-2512 基础模型、RL模型及其训练/推理脚本
- [x] 发布 Fun-CosyVoice3-0.5B modelscope gradio space
[x] 2025/08
- [x] 感谢 NVIDIA Yuekai Zhang 的贡献，添加 triton trtllm 运行时支持和 cosyvoice2 grpo 训练支持
[x] 2025/07
- [x] 发布 Fun-CosyVoice 3.0 评测集
[x] 2025/05
- [x] 添加 CosyVoice2-0.5B vllm 支持
[x] 2024/12
- [x] 发布 25hz CosyVoice2-0.5B
[x] 2024/09
- [x] 25hz CosyVoice-300M 基础模型
- [x] 25hz CosyVoice-300M 声音转换功能
[x] 2024/08
- [x] 重复感知采样（RAS）推理，提升LLM稳定性
- [x] 流式推理模式支持，包括 kv cache 和 sdpa 用于 rtf 优化
[x] 2024/07
- [x] Flow matching 训练支持
- [x] 当 ttsfrd 不可用时支持 WeTextProcessing
- [x] Fastapi 服务端和客户端

评测

模型	开源	模型大小	test-zh<br>CER (%) ↓	test-zh<br>说话人相似度 (%) ↑	test-en<br>WER (%) ↓	test-en<br>说话人相似度 (%) ↑	test-hard<br>CER (%) ↓	test-hard<br>说话人相似度 (%) ↑
Human	-	-	1.26	75.5	2.14	73.4	-	-
Seed-TTS	❌	-	1.12	79.6	2.25	76.2	7.59	77.6
MiniMax-Speech	❌	-	0.83	78.3	1.65	69.2	-	-
F5-TTS	✅	0.3B	1.52	74.1	2.00	64.7	8.67	71.3
Spark TTS	✅	0.5B	1.2	66.0	1.98	57.3	-	-
CosyVoice2	✅	0.5B	1.45	75.7	2.57	65.9	6.83	72.4
FireRedTTS2	✅	1.5B	1.14	73.2	1.95	66.5	-	-
Index-TTS2	✅	1.5B	1.03	76.5	2.23	70.6	7.12	75.5
VibeVoice-1.5B	✅	1.5B	1.16	74.4	3.04	68.9	-	-
VibeVoice-Realtime	✅	0.5B	-	-	2.05	63.3	-	-
HiggsAudio-v2	✅	3B	1.50	74.0	2.44	67.7	-	-
VoxCPM	✅	0.5B	0.93	77.2	1.85	72.9	8.87	73.0
GLM-TTS	✅	1.5B	1.03	76.1	-	-	-	-
GLM-TTS RL	✅	1.5B	0.89	76.4	-	-	-	-
Fun-CosyVoice3-0.5B-2512	✅	0.5B	1.21	78.0	2.24	71.8	6.71	75.8
Fun-CosyVoice3-0.5B-2512_RL	✅	0.5B	0.81	77.4	1.68	69.5	5.44	75.0

安装

克隆并安装

克隆仓库

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
# 如果由于网络原因克隆子模块失败，请运行以下命令直到成功
cd CosyVoice
git submodule update --init --recursive

安装 Conda：请参阅 https://docs.conda.io/en/latest/miniconda.html

创建 Conda 环境：

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

# 如果遇到 sox 兼容性问题
# ubuntu
sudo apt-get install sox libsox-dev
# centos
sudo yum install sox sox-devel

模型下载

from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')

可选地，您可以解压 ttsfrd 资源并安装 ttsfrd 包以获得更好的文本规范化性能。

注意这一步不是必须的。如果您没有安装 ttsfrd 包，我们将默认使用 wetext。

cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl

基本用法

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

""" CosyVoice3 用法，查看 https://funaudiollm.github.io/cosyvoice3/ 了解更多详情
"""
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
# 英语零样本用法
for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
                                                    './asset/zero_shot_prompt.wav', stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# 中文零样本用法
for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
                                                    './asset/zero_shot_prompt.wav', stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# 细粒度控制，支持的控制标签请查看 cosyvoice/tokenizer/tokenizer.py#L280
for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点，[breath]邻居都很活络，[breath]嗯，都很熟悉。[breath]',
                                                        './asset/zero_shot_prompt.wav', stream=False)):
    torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# 指令用法，支持的指令请查看 cosyvoice/utils/common.py#L28
for i, j in enumerate(cosyvoice.inference_instruct2('好少咯，一般系放嗰啲国庆啊，中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
                                                    './asset/zero_shot_prompt.wav', stream=False)):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
                                                    './asset/zero_shot_prompt.wav', stream=False)):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# 热修复用法
for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
                                                    './asset/zero_shot_prompt.wav', stream=False)):
    torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

讨论与交流

您可以直接在 Github Issues 上讨论。

您也可以扫描二维码加入我们的官方钉钉交流群。

致谢

我们借鉴了大量 FunASR 的代码。
我们借鉴了大量 FunCodec 的代码。
我们借鉴了大量 Matcha-TTS 的代码。
我们借鉴了大量 AcademiCodec 的代码。
我们借鉴了大量 WeNet 的代码。

引用

@article{du2024cosyvoice,
  title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
  journal={arXiv preprint arXiv:2407.05407},
  year={2024}
}

@article{du2024cosyvoice,
  title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
  author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
  journal={arXiv preprint arXiv:2412.10117},
  year={2024}
}

@article{du2025cosyvoice,
  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
  journal={arXiv preprint arXiv:2505.17589},
  year={2025}
}

@inproceedings{lyu2025build,
  title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
  author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--2},
  year={2025},
  organization={IEEE}
}

免责声明

以上内容仅供学术用途，旨在展示技术能力。部分示例来源于网络。如果任何内容侵犯了您的权益，请联系我们要求删除。

rdjpoi/Fun-CosyVoice3-0.5B-2512

作者 rdjpoi

text-to-speech

↓ 1 ♥ 0

创建时间: 2026-02-14 22:52:15+00:00

更新时间: 2026-02-14 22:52:16+00:00

在 Hugging Face 上查看

文件 (20)

.gitattributes

CosyVoice-BlankEN/config.json

CosyVoice-BlankEN/generation_config.json

CosyVoice-BlankEN/merges.txt

CosyVoice-BlankEN/model.safetensors

CosyVoice-BlankEN/tokenizer_config.json

CosyVoice-BlankEN/vocab.json

README.md

asset/dingding.png

campplus.onnx ONNX

config.json

configuration.json

cosyvoice3.yaml

flow.decoder.estimator.fp32.onnx ONNX

flow.pt

hift.pt

llm.pt

llm.rl.pt

speech_tokenizer_v3.batch.onnx ONNX

speech_tokenizer_v3.onnx ONNX