ONNX 模型库
返回模型

说明文档

Matcha-TTS CommonVoice EN001

你可以测试变体模型 | Github演示

源音频

https://commonvoice.mozilla.org/en/datasets Common Voice Corpus 1

我将音频命名为 42da7f26(头部音频ID)_290(文件数) EN001 (不计划在此仓库中包含音频)

有什么优点?

LJSpeech质量更好,但那是女声。这个是男声。

VCTK有109个相似质量的语音,但那是ODC-By许可证。

这个音频采用MIT许可证,更容易继续训练或做其他事情。

不过我建议你使用VCTK,ODC-By许可证问题不大。我将来会用这个创建新的语音。

如何训练

使用IPA文本训练(此分支) https://github.com/akjava/Matcha-TTS-Japanese

查看此仓库的配置文件。 不过还没有音频复制工具。以后再做。

文件信息

checkpoints

Matcha-TTS检查点 - epoch看起来很大,但只用290个音频训练

遗憾的是我丢失了3599-4499之间的检查点。很抱歉。

从训练指标来看。 6399似乎过拟合了,不过我的英语听力不好,无法评估。

ONNX

github代码 - 查看源代码 github页面 - 测试Onnx示例

onnx简化后加载速度快了1.5倍。

from onnxsim import simplify
import onnx

model = onnx.load("en001_6399_T2.onnx")
model_simp, check = simplify(model)

onnx.save(model_simp, "en001_6399_T2_simplify.onnx")

timesteps是默认值(5),较小的timesteps;推理速度稍快,但质量较低。

如果需要原始onnx,按官方方式操作

python -m matcha.onnx.export checkpoint_epoch=5699.ckpt en001_5699t2.onnx  --vocoder-name hifigan_T2_v1 --n-timesteps 5 --vocoder-checkpoint generator_v1
python -m matcha.onnx.export checkpoint_epoch=5699.ckpt en001_5699.onnx  --vocoder-name hifigan_univ_v1 --n-timesteps 5 --vocoder-checkpoint g_02500000
  • T2表示声码器是hifigan_T2_v1
  • Unif表示声码器是hifigan_univ_v1

你可以量化这个onnx,但体积小3倍,速度慢4-5倍,这就是我没有包含它的原因。

from onnxruntime.quantization import quantize_dynamic, QuantType
quantized_model = quantize_dynamic(src_model_path, dst_model_path, weight_type=QuantType.QUInt8)

使用onnx需要一些东西,下面是旧示例

const _pad = "_";
const _punctuation = ";:,.!?¡¿—…\"«»\"\" ";
const _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
const _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ";

// 下面的代码使用展开语法
const Symbols = [_pad, ..._punctuation, ..._letters, ..._letters_ipa];

const SpaceId = Symbols.indexOf(' ');

const symbolToId = {};
const idToSymbol = {};

// 初始化 symbolToId 和 idToSymbol
for (let i = 0; i < Symbols.length; i++) {
symbolToId[Symbols[i]] = i;
idToSymbol[i] = Symbols[i];
}

class MatchaOnnx {
    constructor() {
    }
    async load_model(model_path,options={}){
        this.session = await ort.InferenceSession.create(model_path,options);
    }

        get_output_names_html(){
        if (typeof this.session=='undefined'){
            return null
        }
        let outputNamesString = '[outputs]<br>';
        const outputNames = this.session.outputNames;
        for (let outputName of outputNames) {
            console.log(outputName)
            outputNamesString+=outputName+"<br>"
        }
        return outputNamesString.trim()
    }

    get_input_names_html(){
        if (typeof this.session=='undefined'){
            return null
        }
        
        let inputNamesString = '[Inputs]<br>';
        const inputNames = this.session.inputNames;

        for (let inputName of inputNames) {
            console.log(inputName)
            inputNamesString+=inputName+"<br>"
        }
        return inputNamesString.trim()
    }


    processText(text) {
    const x = this.intersperse(this.textToSequence(text));
    const x_phones = this.sequenceToText(x);
    const textList = [];
    for (let i = 1; i < x_phones.length; i += 2) {
    textList.push(x_phones[i]);
    }

    return {
    x: x,
    x_length: x.length,
    x_phones: x_phones,
    x_phones_label: textList.join(""),
    };
}


    basicCleaners2(text, lowercase = false) {
    if (lowercase) {
    text = text.toLowerCase();
    }
    text = text.replace(/\s+/g, " ");
    return text;
}

    textToSequence(text) {
    const sequenceList = [];
    const clean_text = this.basicCleaners2(text);
    for (let i = 0; i < clean_text.length; i++) {
    const symbol = clean_text[i];
    sequenceList.push(symbolToId[symbol]);
    }
    return sequenceList;
}

    intersperse(sequence, item = 0) {
    const sequenceList = [item];
    for (let i = 0; i < sequence.length; i++) {
    sequenceList.push(sequence[i]);
    sequenceList.push(item);
    }
    return sequenceList;
    }

    sequenceToText(sequence) {
    const textList = [];
    for (let i = 0; i < sequence.length; i++) {
    const symbol = idToSymbol[sequence[i]];
    textList.push(symbol);
    }
    return textList.join("");
}

async infer(text, temperature, speed) {
    console.log(this.session)
    const dic = this.processText(text);
console.log(`x:${dic.x.join(", ")}`);
console.log(`x_length:${dic.x_length}`);
console.log(`x_phones_label:${dic.x_phones_label}`);
    
// 准备输入张量(假设你的ONNX Runtime库使用类似的语法)
//const x_tensor = new this.session.Tensor('long', dic.x, [1, dic.x.length]);
//const x_length_tensor = new this.session.Tensor('long', [dic.x.length], [1]);
//const scales_tensor = new this.session.Tensor('float', [temperature, speed], [2]);

const dataX = new BigInt64Array(dic.x.length)
for (let i = 0; i < dic.x.length; i++) {
    //console.log(dic.x[i])
    dataX[i] = BigInt(dic.x[i]); // 将每个数字转换为BigInt
    }
const data_x_length = new BigInt64Array(1)
data_x_length[0] = BigInt(dic.x_length)

//const dataX = Int32Array.from([dic.x_length])
const tensorX = new ort.Tensor('int64', dataX, [1, dic.x.length]);
// const data_x_length = Int32Array.from([dic.x_length])
const tensor_x_length = new ort.Tensor('int64', data_x_length, [1]);
const data_scale = Float32Array.from( [temperature, speed])
const tensor_scale = new ort.Tensor('float32', data_scale, [2]);


// 运行推理
const output = await this.session.run({
x: tensorX,
x_lengths: tensor_x_length,
scales: tensor_scale,
});
console.log(output)
// 提取输出(假设你的ONNX Runtime库使用类似的语法)
const wav_array = output.wav.data;
console.log(wav_array[0]);
console.log(wav_array.length);

const x_lengths_array = output.wav_lengths.data;
console.log(x_lengths_array.join(", "));

return wav_array;
}


}

转换为wav



function webWavPlay(f32array){
    blob = float32ArrayToWav(f32array)
    url = createObjectUrlFromBlob(blob)
    console.log(url)
    playAudioFromUrl(url)
}

function createObjectUrlFromBlob(blob) {
    const url = URL.createObjectURL(blob);
    return url;
    }

function playAudioFromUrl(url) {
    const audio = new Audio(url);
    audio.play().catch(error => console.error('Failed to play audio:', error));
    }

    
//我复制的
//https://huggingface.co/spaces/k2-fsa/web-assembly-tts-sherpa-onnx-de/blob/main/app-tts.js
        // 这个函数复制/修改自
// https://gist.github.com/meziantou/edb7217fddfbb70e899e
function float32ArrayToWav(floatSamples, sampleRate=22050) {
        let samples = new Int16Array(floatSamples.length);
        for (let i = 0; i < samples.length; ++i) {
          let s = floatSamples[i];
          if (s >= 1)
            s = 1;
          else if (s <= -1)
            s = -1;
      
          samples[i] = s * 32767;
        }
      
        let buf = new ArrayBuffer(44 + samples.length * 2);
        var view = new DataView(buf);
      
        // http://soundfile.sapp.org/doc/WaveFormat/
        //                   F F I R
        view.setUint32(0, 0x46464952, true);               // chunkID
        view.setUint32(4, 36 + samples.length * 2, true);  // chunkSize
        //                   E V A W
        view.setUint32(8, 0x45564157, true);  // format
                                              //
        //                      t m f
        view.setUint32(12, 0x20746d66, true);          // subchunk1ID
        view.setUint32(16, 16, true);                  // subchunk1Size, 16 for PCM
        view.setUint32(20, 1, true);                   // audioFormat, 1 for PCM
        view.setUint16(22, 1, true);                   // numChannels: 1 channel
        view.setUint32(24, sampleRate, true);          // sampleRate
        view.setUint32(28, sampleRate * 2, true);      // byteRate
        view.setUint16(32, 2, true);                   // blockAlign
        view.setUint16(34, 16, true);                  // bitsPerSample
        view.setUint32(36, 0x61746164, true);          // Subchunk2ID
        view.setUint32(40, samples.length * 2, true);  // subchunk2Size
      
        let offset = 44;
        for (let i = 0; i < samples.length; ++i) {
          view.setInt16(offset, samples[i], true);
          offset += 2;
        }
      
        return new Blob([view], {type: 'audio/wav'});
      }

音频

我使用VAD工具切割,并使用resemble-enhance进行降噪

Akjava/matcha_tts_common_voice_01_en_001

作者 Akjava

text-to-speech
↓ 0 ♥ 1

创建时间: 2024-08-25 10:56:19+00:00

更新时间: 2024-09-19 11:17:39+00:00

在 Hugging Face 上查看

文件 (41)

.gitattributes
README.md
checkpoint_epoch=6399.ckpt
checkpoints/checkpoint_epoch=2599.ckpt
checkpoints/checkpoint_epoch=2699.ckpt
checkpoints/checkpoint_epoch=2799.ckpt
checkpoints/checkpoint_epoch=2899.ckpt
checkpoints/checkpoint_epoch=2999.ckpt
checkpoints/checkpoint_epoch=3099.ckpt
checkpoints/checkpoint_epoch=3199.ckpt
checkpoints/checkpoint_epoch=3299.ckpt
checkpoints/checkpoint_epoch=3399.ckpt
checkpoints/checkpoint_epoch=3499.ckpt
checkpoints/checkpoint_epoch=3599.ckpt
checkpoints/checkpoint_epoch=4499.ckpt
checkpoints/checkpoint_epoch=4599.ckpt
checkpoints/checkpoint_epoch=4699.ckpt
checkpoints/checkpoint_epoch=4799.ckpt
checkpoints/checkpoint_epoch=4899.ckpt
checkpoints/checkpoint_epoch=4999.ckpt
checkpoints/checkpoint_epoch=5099.ckpt
checkpoints/checkpoint_epoch=5199.ckpt
checkpoints/checkpoint_epoch=5299.ckpt
checkpoints/checkpoint_epoch=5399.ckpt
checkpoints/checkpoint_epoch=5499.ckpt
checkpoints/checkpoint_epoch=5599.ckpt
checkpoints/checkpoint_epoch=5699.ckpt
checkpoints/checkpoint_epoch=5799.ckpt
checkpoints/checkpoint_epoch=5899.ckpt
checkpoints/checkpoint_epoch=5999.ckpt
checkpoints/checkpoint_epoch=6099.ckpt
checkpoints/checkpoint_epoch=6199.ckpt
checkpoints/checkpoint_epoch=6299.ckpt
checkpoints/checkpoint_epoch=6399.ckpt
checkpoints/last.ckpt
config/data/en001.yaml
config/experiment/en001.yaml
en001_ep6399_T2_simplify.onnx ONNX
en001_ep6399_univ_simplify.onnx ONNX
tensorboard/version_0/events.out.tfevents.1724550248.n9s76o0u98.39361.0
tensorboard/version_0/events.out.tfevents.1724558876.n01l8nvxtt.1397.0