说明文档

DistilCamemBERT-Sentiment

我们推出 DistilCamemBERT-Sentiment，这是针对法语情感分析任务微调的 DistilCamemBERT 模型。该模型使用两个数据集构建：Amazon Reviews 和 Allociné.fr，以最大程度减少偏差。事实上，亚马逊评论信息相似且相对较短，而 Allociné 影评则是长篇且内容丰富的文本。这种建模方式与基于 CamemBERT 模型的 tblard/tf-allocine 类似。

基于 CamemBERT 的模型问题在于扩展阶段，例如在生产环境中。事实上，推理成本可能成为一个技术难题。为了解决这个问题，我们提出了这种建模方案，借助 DistilCamemBERT，在相同功耗下将推理时间减半。

数据集

数据集包含来自亚马逊的 204,993 条训练评论和 4,999 条测试评论，以及来自 Allocine 网站的 235,516 条和 4,729 条影评。数据集分为五个类别：

1 星：表示极差评价，
2 星：差评，
3 星：中性评价，
4 星：好评，
5 星：优秀评价。

评估结果

除了准确率（此处称为精确准确率）之外，为了对正负 1 星的估计误差具有鲁棒性，我们采用以下定义作为性能度量：

$$\mathrm{top!-!2; acc}=\frac{1}{|\mathcal{O}|}\sum_{i\in\mathcal{O}}\sum_{0\leq l < 2}\mathbb{1}(\hat{f}_{i,l}=y_i)$$

其中 $\hat{f}_l$ 是第 l 大的预测标签，$y$ 是真实标签，$\mathcal{O}$ 是观测的测试集，$\mathbb{1}$ 是指示函数。

类别	精确准确率 (%)	top-2 准确率 (%)	样本数
全局	61.01	88.80	9,698
1 星	87.21	77.17	1,905
2 星	79.19	84.75	1,935
3 星	77.85	78.98	1,974
4 星	78.61	90.22	1,952
5 星	85.96	82.92	1,932

基准测试

该模型与 3 个参考模型进行比较（见下文）。由于每个模型的目标定义不完全相同，我们详细说明了每个模型使用的性能度量方法。平均推理时间测量使用 AMD Ryzen 5 4500U @ 2.3GHz 6 核处理器。

bert-base-multilingual-uncased-sentiment

nlptown/bert-base-multilingual-uncased-sentiment 基于多语言小写版本的 BERT 模型。该情感分析器在亚马逊评论上训练，与我们的模型类似。因此目标和定义相同。

模型	时间 (ms)	精确准确率 (%)	top-2 准确率 (%)
cmarkea/distilcamembert-base-sentiment	95.56	61.01	88.80
nlptown/bert-base-multilingual-uncased-sentiment	187.70	54.41	82.82

tf-allociné 和 barthez-sentiment-classification

基于 CamemBERT 模型的 tblard/tf-allocine 和基于 BARThez 的 moussaKam/barthez-sentiment-classification 使用相同的二分类定义。为了将其转换为二分类问题，我们只将 "1 星" 和 "2 星" 标签视为负面情感，"4 星" 和 "5 星" 视为正面情感。我们排除 "3 星"，因为它可以解释为中性类别。在这种情况下，正负 1 星估计误差的问题消失了。然后我们仅使用经典的准确率定义。

模型	时间 (ms)	精确准确率 (%)
cmarkea/distilcamembert-base-sentiment	95.56	97.52
tblard/tf-allocine	329.74	95.69
moussaKam/barthez-sentiment-classification	197.95	94.29

如何使用 DistilCamemBERT-Sentiment

from transformers import pipeline

analyzer = pipeline(
    task='text-classification',
    model="cmarkea/distilcamembert-base-sentiment",
    tokenizer="cmarkea/distilcamembert-base-sentiment"
)

result = analyzer(
    "J'aime me promener en forêt même si ça me donne mal aux pieds.",
    return_all_scores=True
)

result
[{'label': '1 star', 'score': 0.047529436647892},
 {'label': '2 stars', 'score': 0.14150355756282806},
 {'label': '3 stars', 'score': 0.3586442470550537},
 {'label': '4 stars', 'score': 0.3181498646736145},
 {'label': '5 stars', 'score': 0.13417290151119232}]

Optimum + ONNX

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)

onnx_qa = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Quantized onnx model
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    HUB_MODEL,
    file_name="model_quantized.onnx"
)

引用

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}

cmarkea/distilcamembert-base-sentiment

作者 cmarkea

text-classification transformers

↓ 104.7K ♥ 37

创建时间: 2022-03-02 23:29:05+00:00

更新时间: 2024-10-26 07:15:09+00:00

在 Hugging Face 上查看

文件 (9)

.gitattributes

README.md

config.json

model.onnx ONNX

model.safetensors

pytorch_model.bin

sentencepiece.bpe.model

tf_model.h5

tokenizer_config.json