说明文档
<div> <p style="margin-bottom: 0; margin-top: 0;"> <strong>请查看<a href="https://huggingface.co/collections/unsloth/text-to-speech-tts-models-68007ab12522e96be1e02155">我们的合集</a>以获取我们所有TTS模型的上传。</strong> </p> <p style="margin-bottom: 0;"> <em>学习微调TTS模型 - <a href="https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning">阅读我们的指南</a>。</em> </p> <p style="margin-top: 0;margin-bottom: 0;"> <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a>实现了卓越的准确度,并优于其他领先的量化方案。</em> </p> <div style="display: flex; gap: 5px; align-items: center; "> <a href="https://github.com/unslothai/unsloth/"> <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> </a> <a href="https://discord.gg/unsloth"> <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> </a> <a href="https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning"> <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> </a> </div> <h1 style="margin-top: 0rem;">✨ 使用Unsloth运行和微调TTS模型!</h1> </div>
- 使用我们的Google Colab笔记本免费微调TTS模型!
- 阅读我们关于TTS支持的博客:unsloth.ai/blog/tts
| Unsloth支持 | 免费笔记本 | 性能 | 显存占用 |
|---|---|---|---|
| Llasa-3B | ▶️ 在Colab上开始 | 1.5倍加速 | 减少58% |
| Whisper Large V3 | ▶️ 在Colab上开始 | 1.5倍加速 | 减少50% |
| Qwen3 (14B) | ▶️ 在Colab上开始 | 2倍加速 | 减少70% |
| Llama 3.2 Vision (11B) | ▶️ 在Colab上开始 | 1.8倍加速 | 减少50% |
更新(2025-05-10): 有时我发现top_p=0.95和temperature=0.9能产生更稳定的结果。
更新(2025-02-13): 添加了Llasa微调说明。
更新(2025-02-07): 我们的论文已发布!
LLaSA:基于LLaMA的语音合成训练时和推理时计算扩展
模型信息
我们的模型Llasa是一个文本转语音(TTS)系统,它通过整合来自XCodec2码本的语音token来扩展基于文本的LLaMA(1B、3B和8B)语言模型,该码本包含65,536个token。我们在包含250,000小时中英文语音数据的数据集上训练了Llasa。该模型能够仅从输入文本或利用给定的语音提示生成语音。
该方法与Llama框架无缝兼容,使训练TTS类似于训练LLM(将音频转换为单码本token,并将其简单地视为一种特殊的语言)。这为将现有的LLM压缩、加速和微调方法应用于此打开了可能性。
如何使用
安装XCodec2。
1. 仅从输入文本进行语音合成
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b ='HKUSTAudio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUSTAudio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()
input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.'
# input_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
def ids_to_speech_tokens(speech_ids):
speech_tokens_str = []
for speech_id in speech_ids:
speech_tokens_str.append(f"<|s_{speech_id}|>")
return speech_tokens_str
def extract_speech_ids(speech_tokens_str):
speech_ids = []
for token_str in speech_tokens_str:
if token_str.startswith('<|s_') and token_str.endswith('|>'):
num_str = token_str[4:-2]
num = int(num_str)
speech_ids.append(num)
else:
print(f"Unexpected token: {token_str}")
return speech_ids
#TTS开始!
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
# 对文本进行分词
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=True,
return_tensors='pt',
continue_final_message=True
)
input_ids = input_ids.to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
# 自回归生成语音
outputs = model.generate(
input_ids,
max_length=2048, # 我们使用2048的最大长度训练模型
eos_token_id= speech_end_id ,
do_sample=True,
top_p=1, # 调整生成内容的多样性
temperature=0.8, # 控制输出的随机性
)
# 提取语音token
generated_ids = outputs[0][input_ids.shape[1]:-1]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
# 将token <|s_23456|> 转换为整数 23456
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
# 将语音token解码为语音波形
gen_wav = Codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
2. 利用给定语音提示进行语音合成
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b ='HKUSTAudio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUSTAudio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()
# 仅支持16kHz语音!
prompt_wav, sr = sf.read("太乙真人.wav") # 您可以在文件中找到该wav
#prompt_wav, sr = sf.read("Anna.wav") # 英文提示
prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)
prompt_text ="对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。"
#promt_text = "A chance to leave him alone, but... No. She just wanted to see him again. Anna, you don't know how it feels to lose a sister. Anna, I'm sorry, but your father asked me not to tell you anything."
target_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
#target_text = "Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me."
input_text = prompt_text + target_text
def ids_to_speech_tokens(speech_ids):
speech_tokens_str = []
for speech_id in speech_ids:
speech_tokens_str.append(f"<|s_{speech_id}|>")
return speech_tokens_str
def extract_speech_ids(speech_tokens_str):
speech_ids = []
for token_str in speech_tokens_str:
if token_str.startswith('<|s_') and token_str.endswith('|>'):
num_str = token_str[4:-2]
num = int(num_str)
speech_ids.append(num)
else:
print(f"Unexpected token: {token_str}")
return speech_ids
#TTS开始!
with torch.no_grad():
# 编码提示语音
vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
print("Prompt Vq Code Shape:", vq_code_prompt.shape )
vq_code_prompt = vq_code_prompt[0,0,:]
# 将整数 12345 转换为token <|s_12345|>
speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
# 对文本和语音前缀进行分词
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
]
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=True,
return_tensors='pt',
continue_final_message=True
)
input_ids = input_ids.to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
# 自回归生成语音
outputs = model.generate(
input_ids,
max_length=2048, # 我们使用2048的最大长度训练模型
eos_token_id= speech_end_id ,
do_sample=True,
top_p=1,
temperature=0.8,
)
# 提取语音token
generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
# 将token <|s_23456|> 转换为整数 23456
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
# 将语音token解码为语音波形
gen_wav = Codec_model.decode_code(speech_tokens)
# 如果只需要生成的部分
# gen_wav = gen_wav[:,:,prompt_wav.shape[1]:]
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
免责声明
本模型采用CC BY-NC 4.0许可证授权,出于道德和隐私考虑,禁止免费商业使用;发现违规行为将承担法律后果。
本代码库严禁在任何国家或地区用于任何非法目的。请参考您当地关于DMCA和其他相关法律的法规。
Prince-1/Llasa-3B
作者 Prince-1
创建时间: 2025-07-04 08:05:05+00:00
更新时间: 2025-07-04 08:13:32+00:00
在 Hugging Face 上查看