说明文档

多语言 mDeBERTa-v3-base-mnli-xnli

模型描述

该多语言模型可以对100种语言进行自然语言推理（NLI），因此也适用于多语言零样本分类。该模型最初由微软在CC100多语言数据集上进行预训练，然后在XNLI数据集（包含15种语言的假设-前提对）以及英语MNLI数据集上进行微调。截至2021年12月，mDeBERTa-base是表现最好的多语言基础尺寸transformer模型，由微软在这篇论文中引入。

如果您在寻找更小、更快（但性能较低）的模型，可以尝试multilingual-MiniLMv2-L6-mnli-xnli。

如何使用模型

简单的零样本分类管道

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
candidate_labels = ["politics", "economy", "entertainment", "environment"]
output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
print(output)

NLI用例

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
hypothesis = "Emmanuel Macron is the President of France"

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

训练数据

该模型在XNLI开发数据集和MNLI训练数据集上进行训练。XNLI开发集包含2490个从英语专业翻译成其他14种语言的文本（共计37350个文本）（见这篇论文）。请注意，XNLI包含一个训练集，即MNLI数据集的15种语言的机器翻译版本，但由于这些机器翻译的质量问题，该模型仅在XNLI开发集的专业翻译和原始英语MNLI训练集（392702个文本）上进行训练。不使用机器翻译文本可以避免模型过度拟合于15种语言；避免灾难性遗忘mDeBERTa预训练的其他85种语言；并显著降低训练成本。

训练程序

mDeBERTa-v3-base-mnli-xnli使用Hugging Face trainer进行训练，具有以下超参数。

training_args = TrainingArguments(
    num_train_epochs=2,              # 总训练轮数
    learning_rate=2e-05,
    per_device_train_batch_size=16,   # 训练时每个设备的批大小
    per_device_eval_batch_size=16,    # 评估时的批大小
    warmup_ratio=0.1,                # 学习率调度器的预热步数
    weight_decay=0.06,               # 权重衰减强度
)

评估结果

该模型在15种语言的XNLI测试集上进行评估（每种语言5010个文本，共75150个）。请注意，多语言NLI模型能够在没有特定语言NLI训练数据的情况下对NLI文本进行分类（跨语言迁移）。这意味着该模型还能够对mDeBERTa预训练的其他85种语言进行NLI，但性能可能低于XNLI中可用的语言。

另请注意，如果模型库中的其他多语言模型声称在英语以外的语言上达到约90%的性能，作者很可能在测试过程中犯了错误，因为最新的论文都没有显示多语言平均性能超过XNLI的80%以上（见这里或这里）。

平均	ar	bg	de	el	en	es	fr	hi	ru	sw	th	tr	ur	vi	zh
0.808	0.802	0.829	0.825	0.826	0.883	0.845	0.834	0.771	0.813	0.748	0.793	0.807	0.740	0.795	0.8116

局限性与偏见

请参阅原始DeBERTa-V3论文和关于不同NLI数据集的文献以了解潜在的偏见。

引用

如果使用此模型，请引用：Laurer, Moritz, Wouter van Atteveldt, Andreu Salleras Casas, and Kasper Welbers. 2022. 'Less Annotating, More Classifying – Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT - NLI'. Preprint, June. Open Science Framework. https://osf.io/74b8k.

合作想法或问题？

如果您有问题或合作想法，请通过m{dot}laurer{at}vu{dot}nl或LinkedIn联系我。

调试与问题

请注意，DeBERTa-v3于2021年底发布，较旧版本的HF Transformers在运行该模型时似乎存在问题（例如导致分词器问题）。使用Transformers>=4.13或更高版本可能会解决一些问题。请注意，mDeBERTa目前不支持FP16，见这里：https://github.com/microsoft/DeBERTa/issues/77

MoritzLaurer/mDeBERTa-v3-base-mnli-xnli

作者 MoritzLaurer

zero-shot-classification transformers

↓ 172.3K ♥ 298

创建时间: 2022-03-02 23:29:04+00:00

更新时间: 2024-01-08 12:37:16+00:00

在 Hugging Face 上查看

文件 (12)

.gitattributes

README.md

added_tokens.json

config.json

model.safetensors

onnx/model.onnx ONNX

onnx/model_quantized.onnx ONNX

pytorch_model.bin

special_tokens_map.json

spm.model

tokenizer.json

tokenizer_config.json