ONNX 模型库
返回模型

说明文档

<!-- English | 中文 -->

gte-large-en-v1.5

我们推出 gte-v1.5 系列,这是升级版的 gte 嵌入模型,支持高达 8192 的上下文长度,同时进一步提升了模型性能。 该模型基于 transformer++ 编码器骨干网络(BERT + RoPE + GLU)构建。

gte-v1.5 系列在同模型规模下于 MTEB 基准测试中取得了最优分数,并在 LoCo 长上下文检索测试中表现出色(参见评估)。

我们还推出了gte-Qwen1.5-7B-instruct, 这是一款 SOTA 指令微调多语言嵌入模型,在 MTEB 排名第二,在 C-MTEB 排名第一。

<!-- Provide a longer summary of what this model is. -->

<!-- - Demo [optional]: [More Information Needed] -->

模型列表

模型 语言 模型规模 最大序列长度 维度 MTEB-en LoCo
gte-Qwen1.5-7B-instruct 多语言 7720 32768 4096 67.34 87.57
gte-large-en-v1.5 英语 434 8192 1024 65.39 86.71
gte-base-en-v1.5 英语 137 8192 768 64.11 87.44

开始使用模型

使用以下代码开始使用模型。

# Requires transformers>=4.36.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

建议安装 xformers 并启用 unpadding 以加速,参见enable-unpadding-and-xformers

配合 sentence-transformers 使用:

# Requires sentence_transformers>=2.7.0

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

配合 transformers.js 使用:

// npm i @xenova/transformers
import { pipeline, dot } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-large-en-v1.5', {
    quantized: false, // Comment out this line to use the quantized version
});

// Generate sentence embeddings
const sentences = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });

// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
console.log(similarities); // [41.86354093370361, 77.07076371259589, 37.02981979677899]

训练细节

训练数据

  • 掩码语言建模(MLM):c4-en
  • 弱监督对比预训练(CPT):GTE 预训练数据
  • 监督对比微调:GTE 微调数据

训练流程

为使骨干模型支持 8192 的上下文长度,我们采用了多阶段训练策略。 模型首先在较短长度上进行初步 MLM 预训练。 然后,我们重新采样数据,减少短文本比例,继续进行 MLM 预训练。

整个训练过程如下:

  • MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
  • MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
  • MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
  • CPT: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
  • Fine-tuning: TODO

评估

MTEB

其他模型的结果来自 MTEB 排行榜

gte 评估设置:mteb==1.2.0, fp16 auto mix precision, max_length=8192,并设置 ntk 缩放因子为 2(等同于 rope_base * 2)。

模型名称 参数规模 (M) 维度 序列长度 平均 (56) 分类 (12) 聚类 (11) 配对分类 (3) 重排 (4) 检索 (15) STS (10) 摘要 (1)
gte-large-en-v1.5 409 1024 8192 65.39 77.75 47.95 84.63 58.50 57.91 81.43 30.91
mxbai-embed-large-v1 335 1024 512 64.68 75.64 46.71 87.2 60.11 54.39 85 32.71
multilingual-e5-large-instruct 560 1024 514 64.41 77.56 47.1 86.19 58.58 52.47 84.78 30.39
bge-large-en-v1.5 335 1024 512 64.23 75.97 46.08 87.12 60.03 54.29 83.11 31.61
gte-base-en-v1.5 137 768 8192 64.11 77.17 46.82 85.33 57.66 54.09 81.97 31.17
bge-base-en-v1.5 109 768 512 63.55 75.53 45.77 86.55 58.86 53.25 82.4 31.07

LoCo

模型名称 维度 序列长度 平均 (5) QsmsumRetrieval SummScreenRetrieval QasperAbstractRetrieval QasperTitleRetrieval GovReportRetrieval
gte-qwen1.5-7b 4096 32768 87.57 49.37 93.10 99.67 97.54 98.21
gte-large-v1.5 1024 8192 86.71 44.55 92.61 99.82 97.81 98.74
gte-base-v1.5 768 8192 87.44 49.91 91.78 99.82 97.13 98.58

引用

如果您发现我们的论文或模型有帮助,请考虑以下引用方式:

@article{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  journal={arXiv preprint arXiv:2407.19669},
  year={2024}
}

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}

Alibaba-NLP/gte-large-en-v1.5

作者 Alibaba-NLP

sentence-similarity transformers
↓ 1.4M ♥ 234

创建时间: 2024-04-20 02:54:30+00:00

更新时间: 2024-08-09 03:32:05+00:00

在 Hugging Face 上查看

文件 (18)

.gitattributes
1_Pooling/config.json
README.md
config.json
model.safetensors
modules.json
onnx/model.onnx ONNX
onnx/model_bnb4.onnx ONNX
onnx/model_fp16.onnx ONNX
onnx/model_int8.onnx ONNX
onnx/model_q4.onnx ONNX
onnx/model_quantized.onnx ONNX
onnx/model_uint8.onnx ONNX
sentence_bert_config.json
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.txt