说明文档

XLM-RoBERTa（大型模型）

XLM-RoBERTa 模型基于 2.5TB 经过筛选的 CommonCrawl 数据进行预训练，涵盖 100 种语言。该模型由 Conneau 等人在论文 Unsupervised Cross-lingual Representation Learning at Scale 中提出，并首次发布于此仓库。

免责声明：发布 XLM-RoBERTa 的团队并未为该模型编写模型卡，因此该模型卡由 Hugging Face 团队撰写。

模型描述

XLM-RoBERTa 是 RoBERTa 的多语言版本。它基于 2.5TB 经过筛选的 CommonCrawl 数据进行预训练，涵盖 100 种语言。

RoBERTa 是一个基于 Transformer 架构的预训练模型，采用自监督方式进行预训练。这意味着它仅基于原始文本进行预训练，无需人工为文本添加任何标签（正因如此，它可以使用大量公开可用的数据），并通过自动流程从这些文本中生成输入和标签。

更准确地说，它使用掩码语言建模（MLM）目标进行预训练。具体做法是：取一个句子，随机遮盖输入中 15% 的单词，然后将整个被遮盖的句子通过模型，让模型预测被遮盖的单词。这与传统的循环神经网络（RNN）不同——RNN 通常逐个查看单词，也与 GPT 等自回归模型不同——后者在内部遮盖未来的 token。这种方式使模型能够学习句子的双向表示。

通过这种方式，模型学习到了 100 种语言的内部表示，这些表示随后可用于提取对下游任务有用的特征：例如，如果你有一个带标签的句子数据集，你可以使用 XLM-RoBERTa 模型生成的特征作为输入来训练一个标准分类器。

预期用途与局限性

你可以将原始模型用于掩码语言建模，但它主要用于在下游任务上进行微调。请参阅模型中心查找你感兴趣的任务上的微调版本。

请注意，该模型主要针对需要使用整个句子（可能带有掩码）进行决策的任务进行微调，例如序列分类、token 分类或问答任务。对于文本生成等任务，你应该考虑 GPT2 等模型。

使用方法

你可以直接使用该模型配合 pipeline 进行掩码语言建模：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='xlm-roberta-large')
>>> unmasker("Hello I'm a <mask> model.")

[{'score': 0.10563907772302628,
  'sequence': "Hello I'm a fashion model.",
  'token': 54543,
  'token_str': 'fashion'},
 {'score': 0.08015287667512894,
  'sequence': "Hello I'm a new model.",
  'token': 3525,
  'token_str': 'new'},
 {'score': 0.033413201570510864,
  'sequence': "Hello I'm a model model.",
  'token': 3299,
  'token_str': 'model'},
 {'score': 0.030217764899134636,
  'sequence': "Hello I'm a French model.",
  'token': 92265,
  'token_str': 'French'},
 {'score': 0.026436051353812218,
  'sequence': "Hello I'm a sexy model.",
  'token': 17473,
  'token_str': 'sexy'}]

以下是如何在 PyTorch 中使用该模型获取给定文本特征的示例：

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large')
model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-large")

# 准备输入
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# 前向传播
output = model(**encoded_input)

BibTeX 引用信息

@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

FacebookAI/xlm-roberta-large

作者 FacebookAI

fill-mask transformers

↓ 6.9M ♥ 498

创建时间: 2022-03-02 23:29:04+00:00

更新时间: 2024-02-19 12:48:30+00:00

在 Hugging Face 上查看

文件 (17)

.gitattributes

README.md

config.json

flax_model.msgpack

model.safetensors

onnx/config.json

onnx/model.onnx ONNX

onnx/model.onnx_data

onnx/sentencepiece.bpe.model

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

pytorch_model.bin

sentencepiece.bpe.model

tf_model.h5

tokenizer.json

tokenizer_config.json