说明文档
<!-- English | 中文 -->
gte-large-en-v1.5
我们推出 gte-v1.5 系列,这是升级版的 gte 嵌入模型,支持高达 8192 的上下文长度,同时进一步提升了模型性能。
该模型基于 transformer++ 编码器骨干网络(BERT + RoPE + GLU)构建。
gte-v1.5 系列在同模型规模下于 MTEB 基准测试中取得了最优分数,并在 LoCo 长上下文检索测试中表现出色(参见评估)。
我们还推出了gte-Qwen1.5-7B-instruct,
这是一款 SOTA 指令微调多语言嵌入模型,在 MTEB 排名第二,在 C-MTEB 排名第一。
<!-- Provide a longer summary of what this model is. -->
- 开发者: 阿里巴巴集团智能计算研究所
- 模型类型: 文本嵌入
- 论文: mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
<!-- - Demo [optional]: [More Information Needed] -->
模型列表
| 模型 | 语言 | 模型规模 | 最大序列长度 | 维度 | MTEB-en | LoCo |
|---|---|---|---|---|---|---|
gte-Qwen1.5-7B-instruct |
多语言 | 7720 | 32768 | 4096 | 67.34 | 87.57 |
gte-large-en-v1.5 |
英语 | 434 | 8192 | 1024 | 65.39 | 86.71 |
gte-base-en-v1.5 |
英语 | 137 | 8192 | 768 | 64.11 | 87.44 |
开始使用模型
使用以下代码开始使用模型。
# Requires transformers>=4.36.0
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
建议安装 xformers 并启用 unpadding 以加速,参见enable-unpadding-and-xformers。
配合 sentence-transformers 使用:
# Requires sentence_transformers>=2.7.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
配合 transformers.js 使用:
// npm i @xenova/transformers
import { pipeline, dot } from '@xenova/transformers';
// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-large-en-v1.5', {
quantized: false, // Comment out this line to use the quantized version
});
// Generate sentence embeddings
const sentences = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
console.log(similarities); // [41.86354093370361, 77.07076371259589, 37.02981979677899]
训练细节
训练数据
训练流程
为使骨干模型支持 8192 的上下文长度,我们采用了多阶段训练策略。 模型首先在较短长度上进行初步 MLM 预训练。 然后,我们重新采样数据,减少短文本比例,继续进行 MLM 预训练。
整个训练过程如下:
- MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
- MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
- CPT: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
- Fine-tuning: TODO
评估
MTEB
其他模型的结果来自 MTEB 排行榜。
gte 评估设置:mteb==1.2.0, fp16 auto mix precision, max_length=8192,并设置 ntk 缩放因子为 2(等同于 rope_base * 2)。
| 模型名称 | 参数规模 (M) | 维度 | 序列长度 | 平均 (56) | 分类 (12) | 聚类 (11) | 配对分类 (3) | 重排 (4) | 检索 (15) | STS (10) | 摘要 (1) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| gte-large-en-v1.5 | 409 | 1024 | 8192 | 65.39 | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 |
| mxbai-embed-large-v1 | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 |
| multilingual-e5-large-instruct | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 |
| bge-large-en-v1.5 | 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
| gte-base-en-v1.5 | 137 | 768 | 8192 | 64.11 | 77.17 | 46.82 | 85.33 | 57.66 | 54.09 | 81.97 | 31.17 |
| bge-base-en-v1.5 | 109 | 768 | 512 | 63.55 | 75.53 | 45.77 | 86.55 | 58.86 | 53.25 | 82.4 | 31.07 |
LoCo
| 模型名称 | 维度 | 序列长度 | 平均 (5) | QsmsumRetrieval | SummScreenRetrieval | QasperAbstractRetrieval | QasperTitleRetrieval | GovReportRetrieval |
|---|---|---|---|---|---|---|---|---|
| gte-qwen1.5-7b | 4096 | 32768 | 87.57 | 49.37 | 93.10 | 99.67 | 97.54 | 98.21 |
| gte-large-v1.5 | 1024 | 8192 | 86.71 | 44.55 | 92.61 | 99.82 | 97.81 | 98.74 |
| gte-base-v1.5 | 768 | 8192 | 87.44 | 49.91 | 91.78 | 99.82 | 97.13 | 98.58 |
引用
如果您发现我们的论文或模型有帮助,请考虑以下引用方式:
@article{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
journal={arXiv preprint arXiv:2407.19669},
year={2024}
}
@article{li2023towards,
title={Towards general text embeddings with multi-stage contrastive learning},
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
journal={arXiv preprint arXiv:2308.03281},
year={2023}
}
Alibaba-NLP/gte-large-en-v1.5
作者 Alibaba-NLP
创建时间: 2024-04-20 02:54:30+00:00
更新时间: 2024-08-09 03:32:05+00:00
在 Hugging Face 上查看