说明文档

E5-small-v2

通过弱监督对比预训练的文本嵌入。梁王（Wang, Liang）、杨楠（Nan Yang）、黄晓龙（Xiaolong Huang）、焦斌星（Binxing Jiao）、杨林君（Linjun Yang）、姜大新（Daxin Jiang）、拉根·马宗达尔（Rangan Majumder）、魏福瑞（Furu Wei），arXiv 2022

该模型有12层，嵌入维度为384。

使用方法

以下是编码MS-MARCO段落排序数据集中的查询和段落的示例。

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# 每个输入文本应以 "query: " 或 "passage: " 开头。
# 对于检索以外的任务，只需使用 "query: " 前缀。
input_texts = ['query: how much protein should a female eat',
               'query: summit define',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."]

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small-v2')
model = AutoModel.from_pretrained('intfloat/e5-small-v2')

# 对输入文本进行分词
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# 归一化嵌入
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

训练详情

请参阅我们的论文 https://arxiv.org/pdf/2212.03533.pdf。

基准评估

查看 unilm/e5 以复现 BEIR 和 MTEB基准上的评估结果。

对Sentence Transformers的支持

以下是使用sentence_transformers的示例。

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-small-v2')
input_texts = [
    'query: how much protein should a female eat',
    'query: summit define',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
embeddings = model.encode(input_texts, normalize_embeddings=True)

依赖包要求

pip install sentence_transformers~=2.2.2

贡献者：michaelfeil

常见问题

1. 我需要在前缀 "query: " 和 "passage: " 后面添加输入文本吗？

是的，这就是模型的训练方式，否则您会看到性能下降。

以下是一些经验法则：

对于非对称任务（如开放问答中的段落检索、ad-hoc信息检索），请相应使用 "query: " 和 "passage: "。
对于对称任务（如语义相似度、释义检索），请使用 "query: " 前缀。
如果您想将嵌入用作特征，例如线性探针分类、聚类，请使用 "query: " 前缀。

2. 为什么我复现的结果与模型卡片中报告的略有不同？

不同版本的 transformers 和 pytorch 可能导致可忽略但非零的性能差异。

3. 为什么余弦相似度分数分布在0.7到1.0之间？

这是已知且预期的行为，因为我们对InfoNCE对比损失使用了低温0.01。

对于文本嵌入任务（如文本检索或语义相似度），重要的是分数的相对顺序而不是绝对值，因此这应该不是问题。

引用

如果您发现我们的论文或模型有帮助，请考虑以下引用：

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}

局限性

该模型仅适用于英文文本。长文本将被截断至最多512个token。

intfloat/e5-small-v2

作者 intfloat

sentence-similarity sentence-transformers

↓ 771.4K ♥ 115

创建时间: 2023-05-19 06:45:35+00:00

更新时间: 2025-02-17 03:24:44+00:00

在 Hugging Face 上查看

文件 (20)

.gitattributes

1_Pooling/config.json

README.md

config.json

model.onnx ONNX

model.safetensors

modules.json

onnx/model_O4.onnx ONNX

onnx/model_qint8_avx512_vnni.onnx ONNX

openvino/openvino_model.bin

openvino/openvino_model.xml

openvino/openvino_model_qint8_quantized.bin

openvino/openvino_model_qint8_quantized.xml

pytorch_model.bin

sentence_bert_config.json

special_tokens_map.json

tf_model.h5

tokenizer.json

tokenizer_config.json

vocab.txt