返回模型
说明文档
gte-small
通用文本嵌入模型(General Text Embeddings, GTE)。Towards General Text Embeddings with Multi-stage Contrastive Learning
GTE 模型由阿里巴巴达摩院训练。基于 BERT 框架构建,目前提供三种不同规模的模型,包括 GTE-large、GTE-base 和 GTE-small。GTE 模型基于大规模相关性文本对进行训练,涵盖广泛的领域和场景。这使得 GTE 模型可以应用于各种文本嵌入下游任务,包括信息检索、语义文本相似度、文本重排序等。
性能指标
我们将 GTE 模型与其他流行的文本嵌入模型在 MTEB 基准上进行了性能比较。更多详细的比较结果,请参阅 MTEB 排行榜。
| 模型名称 | 模型大小 (GB) | 维度 | 序列长度 | 平均分 (56) | 聚类 (11) | 配对分类 (3) | 重排序 (4) | 检索 (15) | STS (10) | 摘要 (1) | 分类 (12) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| gte-large | 0.67 | 1024 | 512 | 63.13 | 46.84 | 85.00 | 59.13 | 52.22 | 83.35 | 31.66 | 73.33 |
| gte-base | 0.22 | 768 | 512 | 62.39 | 46.2 | 84.57 | 58.61 | 51.14 | 82.3 | 31.17 | 73.01 |
| e5-large-v2 | 1.34 | 1024 | 512 | 62.25 | 44.49 | 86.03 | 56.61 | 50.56 | 82.05 | 30.19 | 75.24 |
| e5-base-v2 | 0.44 | 768 | 512 | 61.5 | 43.80 | 85.73 | 55.91 | 50.29 | 81.05 | 30.28 | 73.84 |
| gte-small | 0.07 | 384 | 512 | 61.36 | 44.89 | 83.54 | 57.7 | 49.46 | 82.07 | 30.42 | 72.31 |
| text-embedding-ada-002 | - | 1536 | 8192 | 60.99 | 45.9 | 84.89 | 56.32 | 49.25 | 80.97 | 30.8 | 70.93 |
| e5-small-v2 | 0.13 | 384 | 512 | 59.93 | 39.92 | 84.67 | 54.32 | 49.04 | 80.39 | 31.16 | 72.94 |
| sentence-t5-xxl | 9.73 | 768 | 512 | 59.51 | 43.72 | 85.06 | 56.42 | 42.24 | 82.63 | 30.08 | 73.42 |
| all-mpnet-base-v2 | 0.44 | 768 | 514 | 57.78 | 43.69 | 83.04 | 59.36 | 43.81 | 80.28 | 27.49 | 65.07 |
| sgpt-bloom-7b1-msmarco | 28.27 | 4096 | 2048 | 57.59 | 38.93 | 81.9 | 55.65 | 48.22 | 77.74 | 33.6 | 66.19 |
| all-MiniLM-L12-v2 | 0.13 | 384 | 512 | 56.53 | 41.81 | 82.41 | 58.44 | 42.69 | 79.8 | 27.9 | 63.21 |
| all-MiniLM-L6-v2 | 0.09 | 384 | 512 | 56.26 | 42.35 | 82.37 | 58.04 | 41.95 | 78.9 | 30.81 | 63.05 |
| contriever-base-msmarco | 0.44 | 768 | 512 | 56.00 | 41.1 | 82.54 | 53.14 | 41.88 | 76.51 | 30.36 | 66.68 |
| sentence-t5-base | 0.22 | 768 | 512 | 55.27 | 40.21 | 85.18 | 53.09 | 33.63 | 81.14 | 31.39 | 69.81 |
使用方法
代码示例
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
配合 sentence-transformers 使用:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
局限性
该模型仅适用于英文文本,任何超过 512 个 token 的长文本都将被截断。
引用
如果您发现我们的论文或模型有帮助,请考虑按以下方式引用:
@article{li2023towards,
title={Towards general text embeddings with multi-stage contrastive learning},
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
journal={arXiv preprint arXiv:2308.03281},
year={2023}
}
thenlper/gte-small
作者 thenlper
sentence-similarity
sentence-transformers
↓ 876.6K
♥ 185
创建时间: 2023-07-27 10:14:55+00:00
更新时间: 2024-11-16 08:17:33+00:00
在 Hugging Face 上查看文件 (23)
.gitattributes
1_Pooling/config.json
README.md
config.json
coreml/feature-extraction/float32_model.mlpackage/Data/com.apple.CoreML/model.mlmodel
coreml/feature-extraction/float32_model.mlpackage/Data/com.apple.CoreML/weights/weight.bin
coreml/feature-extraction/float32_model.mlpackage/Manifest.json
model.safetensors
modules.json
onnx/model.onnx
ONNX
onnx/model_O4.onnx
ONNX
onnx/model_qint8_avx512_vnni.onnx
ONNX
openvino/openvino_model.bin
openvino/openvino_model.xml
openvino/openvino_model_qint8_quantized.bin
openvino/openvino_model_qint8_quantized.xml
pytorch_model.bin
sentence_bert_config.json
special_tokens_map.json
tf_model.h5
tokenizer.json
tokenizer_config.json
vocab.txt