说明文档

E5-small

新闻（2023年5月）：请切换到 e5-small-v2，它具有更好的性能且使用方法相同。

通过弱监督对比预训练的文本嵌入。梁王、南杨、黄晓龙、姜宾星、杨林俊、姜达辛、兰根·马朱达尔、魏福瑞，arXiv 2022

该模型有12层，嵌入维度为384。

使用方法

以下是编码MS-MARCO段落排名数据集中的查询和段落的示例。

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# 每个输入文本应以 "query: " 或 "passage: " 开头。
# 对于检索以外的任务，您可以简单地使用 "query: " 前缀。
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small')
model = AutoModel.from_pretrained('intfloat/e5-small')

# 对输入文本进行分词
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# 归一化嵌入
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

训练详情

请参阅我们的论文 https://arxiv.org/pdf/2212.03533.pdf。

基准评估

查看 unilm/e5 以复现 BEIR 和 MTEB基准上的评估结果。

Sentence Transformers 支持

以下是使用 sentence_transformers 的示例。

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-small')
input_texts = [
    'query: how much protein should a female eat',
    'query: summit define',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)

包要求

pip install sentence_transformers~=2.2.2

贡献者：michaelfeil

常见问题

1. 是否需要为输入文本添加 "query: " 和 "passage: " 前缀？

是的，这是模型的训练方式，否则您会看到性能下降。

以下是一些经验法则：

对于非对称任务（如开放问答中的段落检索、ad-hoc信息检索），相应地使用 "query: " 和 "passage: "。
对于对称任务（如语义相似度、释义检索），使用 "query: " 前缀。
如果您想将嵌入用作特征（如线性探针分类、聚类），请使用 "query: " 前缀。

2. 为什么我复现的结果与模型卡片中报告的略有不同？

不同版本的 transformers 和 pytorch 可能会导致微小但非零的性能差异。

3. 为什么余弦相似度分数分布在0.7到1.0之间？

这是已知且预期的行为，因为我们使用低温度0.01的InfoNCE对比损失。

对于文本检索或语义相似度等文本嵌入任务，重要的是分数的相对顺序而不是绝对值，因此这应该不是问题。

引用

如果您发现我们的论文或模型有帮助，请考虑按以下方式引用：

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}

局限性

该模型仅适用于英文文本。长文本将被截断至最多512个token。

intfloat/e5-small

作者 intfloat

sentence-similarity sentence-transformers

↓ 112.8K ♥ 44

创建时间: 2022-12-07 06:48:03+00:00

更新时间: 2023-08-07 04:58:08+00:00

在 Hugging Face 上查看

文件 (13)

.gitattributes

1_Pooling/config.json

README.md

config.json

model.onnx ONNX

model.safetensors

modules.json

pytorch_model.bin

sentence_bert_config.json

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.txt