ONNX 模型库
返回模型

说明文档

IndicBERTv2-MLM-only (ONNX)

这是 ai4bharat/IndicBERTv2-MLM-only 的 ONNX 版本。它是通过 这个 Hugging Face Space 自动转换并上传的。

在 Transformers.js 中使用

请参阅 fill-mask 的流水线文档:https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FillMaskPipeline


IndicBERT

一个在 IndicCorp v2 上训练并在 IndicXTREME 基准测试上评估的多语言语言模型。该模型拥有 2.78 亿参数,支持 23 种印度语言和英语。这些模型使用不同的目标和数据集进行训练。模型列表如下:

  • IndicBERT-MLM [模型] - 一个在 IndicCorp v2 上使用 MLM 目标训练的普通 BERT 风格模型
    • +Samanantar [模型] - 使用 Samanantar 平行语料库将 TLM 作为额外目标 [论文] | [数据集]
    • +回译 [模型] - 通过使用 IndicTrans 模型将 IndicCorp v2 数据集的印度语言部分翻译成英语,将 TLM 作为额外目标 [模型]
  • IndicBERT-SS [模型] - 为了促进语言之间更好的词汇共享,我们将印度语言的文字转换为天城文,并使用 MLM 目标训练一个 BERT 风格的模型

运行微调

微调脚本基于 transformers 库。创建一个新的 conda 环境并按如下方式设置:

conda create -n finetuning python=3.9
pip install -r requirements.txt

所有任务遵循相同的结构,请查看各个文件以获取详细的超参数选择。以下命令运行任务的微调:

python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
    --model_name_or_path=$MODEL_NAME \
    --do_train

参数:

  • MODEL_NAME:要微调的模型名称,可以是本地路径或来自 HuggingFace 模型中心 的模型
  • TASK_NAME:[ner, paraphrase, qa, sentiment, xcopa, xnli, flores] 之一

对于 MASSIVE 任务,请使用 官方仓库 中提供的说明

引用

@inproceedings{doddapaneni-etal-2023-towards,
    title = \"Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages\",
    author = \"Doddapaneni, Sumanth  and
      Aralikatte, Rahul  and
      Ramesh, Gowtham  and
      Goyal, Shreya  and
      Khapra, Mitesh M.  and
      Kunchukuttan, Anoop  and
      Kumar, Pratyush\",
    editor = \"Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki\",
    booktitle = \"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",
    month = jul,
    year = \"2023\",
    address = \"Toronto, Canada\",
    publisher = \"Association for Computational Linguistics\",
    url = \"https://aclanthology.org/2023.acl-long.693\",
    doi = \"10.18653/v1/2023.acl-long.693\",
    pages = \"12402--12426\",
    abstract = \"Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.\",
}

onnx-community/IndicBERTv2-MLM-only-ONNX

作者 onnx-community

fill-mask transformers.js
↓ 0 ♥ 0

创建时间: 2025-11-23 14:42:03+00:00

更新时间: 2025-11-23 14:42:34+00:00

在 Hugging Face 上查看

文件 (15)

.gitattributes
README.md
config.json
onnx/model.onnx ONNX
onnx/model_bnb4.onnx ONNX
onnx/model_fp16.onnx ONNX
onnx/model_int8.onnx ONNX
onnx/model_q4.onnx ONNX
onnx/model_q4f16.onnx ONNX
onnx/model_quantized.onnx ONNX
onnx/model_uint8.onnx ONNX
quantize_config.json
special_tokens_map.json
tokenizer.json
tokenizer_config.json