返回模型

说明文档

I'll translate the HuggingFace model README to Simplified Chinese, preserving all code blocks, links, and markdown formatting.

language:

multilingual
en
ar
bg
de
el
es
fr
hi
ru
sw
th
tr
ur
vi
zh license: mit tags:
zero-shot-classification
text-classification
nli
pytorch metrics:
accuracy datasets:
multi_nli
xnli pipeline_tag: zero-shot-classification widget:
text: "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU" candidate_labels: "politics, economy, entertainment, environment"

多语言 MiniLMv2-L6-mnli-xnli

模型描述

这个多语言模型可以在100多种语言上执行自然语言推理（NLI），因此也适用于多语言零样本分类。底层的多语言 MiniLM-L6 模型由微软创建，是从 XLM-RoBERTa-large 蒸馏而来的（详情请参阅原始论文以及此仓库中的更新信息）。该模型随后在 XNLI 数据集上进行了微调，该数据集包含15种语言的假设-前提对，以及英文 MNLI 数据集。

蒸馏模型的主要优势是它们比其教师模型更小（推理速度更快，内存需求更低）。缺点是它们会损失一些更大教师模型的性能。

为了获得最高的推理速度，我推荐使用这个6层模型。如需更高性能，我推荐使用 mDeBERTa-v3-base-mnli-xnli（截至2023年2月14日）。

如何使用该模型

简单的零样本分类管道

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli")

sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
candidate_labels = ["politics", "economy", "entertainment", "environment"]
output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
print(output)

NLI 使用场景

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
hypothesis = "Emmanuel Macron is the President of France"

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

训练数据

该模型在 XNLI 开发数据集和 MNLI 训练数据集上进行了训练。XNLI 开发集包含2490个从英语专业翻译成其他14种语言的文本（共37350个文本）（参见此论文）。请注意，XNLI 包含15种语言的 MNLI 数据集的15个机器翻译版本的训练集，但由于这些机器翻译存在质量问题，该模型仅在 XNLI 开发集的专业翻译和原始英文 MNLI 训练集（392702个文本）上进行训练。不使用机器翻译文本可以避免模型对这15种语言过拟合；避免灾难性遗忘其预训练的其他语言；并显著降低训练成本。

训练过程

该模型使用 Hugging Face trainer 进行训练，超参数如下。确切的基础模型是 mMiniLMv2-L6-H384-distilled-from-XLMR-Large。

training_args = TrainingArguments(
    num_train_epochs=3,              # 训练总轮数
    learning_rate=4e-05,
    per_device_train_batch_size=64,   # 训练时每个设备的批大小
    per_device_eval_batch_size=120,    # 评估时的批大小
    warmup_ratio=0.06,                # 学习率调度器的预热步数比例
    weight_decay=0.01,               # 权重衰减强度
)

评估结果

该模型在15种语言的 XNLI 测试集上进行了评估（每种语言5010个文本，共75150个）。请注意，多语言 NLI 模型能够在没有接收特定语言 NLI 训练数据的情况下对 NLI 文本进行分类（跨语言迁移）。这意味着该模型也能够对其训练过的其他语言进行 NLI，但性能很可能低于 XNLI 中可用的那些语言。

论文中报告的 multilingual-MiniLM-L6 的平均 XNLI 性能为0.68（见表11）。此复现版本的平均性能为0.713。性能的提升可能归功于在训练数据中添加了 MNLI，且该模型是从 XLM-RoBERTa-large 而非 -base 蒸馏而来。

数据集	avg_xnli	ar	bg	de	el	en	es	fr	hi	ru	sw	th	tr	ur	vi	zh
准确率	0.713	0.687	0.742	0.719	0.723	0.789	0.748	0.741	0.691	0.714	0.642	0.699	0.696	0.664	0.723	0.721
速度文本/秒 (A100 GPU, eval_batch=120)	6093.0	6210.0	6003.0	6053.0	5409.0	6531.0	6205.0	5615.0	5734.0	5970.0	6219.0	6289.0	6533.0	5851.0	5970.0	6798.0

数据集	mnli_m	mnli_mm
准确率	0.782	0.8
速度文本/秒 (A100 GPU, eval_batch=120)	4430.0	4395.0

局限性与偏见

请参阅原始论文和不同 NLI 数据集的相关文献以了解潜在偏见。

引用

如果您使用此模型，请引用：Laurer, Moritz, Wouter van Atteveldt, Andreu Salleras Casas, and Kasper Welbers. 2022. 'Less Annotating, More Classifying – Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT - NLI'. Preprint, June. Open Science Framework. https://osf.io/74b8k.

合作想法或问题？

如果您有问题或合作想法，请通过 m{dot}laurer{at}vu{dot}nl 或 LinkedIn 联系我

MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli

作者 MoritzLaurer

zero-shot-classification transformers

↓ 12.5K ♥ 47

创建时间: 2023-02-11 13:10:37+00:00

更新时间: 2024-04-22 11:03:52+00:00

在 Hugging Face 上查看

文件 (15)

.gitattributes

README.md

config.json

model.safetensors

onnx/config.json

onnx/model.onnx ONNX

onnx/sentencepiece.bpe.model

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

pytorch_model.bin

sentencepiece.bpe.model

special_tokens_map.json

tokenizer.json

tokenizer_config.json