说明文档
ModernBERT Embed
ModernBERT Embed 是一个基于 ModernBERT-base 训练的嵌入模型,将 ModernBERT 的最新进展带入到嵌入领域!
该模型基于 Nomic Embed 的弱监督和监督数据集进行训练,同时还支持 Matryoshka 表示学习维度 256,在内存减少 3 倍的情况下仍能保持最佳性能。
性能表现
| 模型 | 维度 | 平均分 (56) | 分类 (12) | 聚类 (11) | 配对分类 (3) | 重排序 (4) | 检索 (15) | STS (10) | 摘要 (1) |
|---|---|---|---|---|---|---|---|---|---|
| nomic-embed-text-v1 | 768 | 62.4 | 74.1 | 43.9 | 85.2 | 55.7 | 52.8 | 82.1 | 30.1 |
| nomic-embed-text-v1.5 | 768 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01 | 81.94 | 30.4 |
| modernbert-embed-base | 768 | 62.62 | 74.31 | 44.98 | 83.96 | 56.42 | 52.89 | 81.78 | 31.39 |
| nomic-embed-text-v1.5 | 256 | 61.04 | 72.1 | 43.16 | 84.09 | 55.18 | 50.81 | 81.34 | 30.05 |
| modernbert-embed-base | 256 | 61.17 | 72.40 | 43.82 | 83.45 | 55.69 | 50.62 | 81.12 | 31.27 |
使用方法
您可以直接使用最新版本的 transformers 来使用这些模型,需要安装 transformers>=4.48.0:
pip install transformers>=4.48.0
提醒一下,该模型的训练方式与 Nomic Embed 类似,需要在输入中添加前缀。如需更多信息,请参阅 Nomic Embed 中的说明。
对于大多数使用场景,在查询前添加 search_query: ,在文档前添加 search_document: 就足够了。
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/modernbert-embed-base")
query_embeddings = model.encode([
"search_query: What is TSNE?",
"search_query: Who is Laurens van der Maaten?",
])
doc_embeddings = model.encode([
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
])
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 768) (1, 768)
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.7214],
# [0.3260]])
<details><summary>点击查看 Sentence Transformers 使用 Matryoshka 截断</summary>
在 Sentence Transformers 中,您可以在加载 SentenceTransformer 模型时使用 truncate_dim 参数来将嵌入截断到更小的维度。
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/modernbert-embed-base", truncate_dim=256)
query_embeddings = model.encode([
"search_query: What is TSNE?",
"search_query: Who is Laurens van der Maaten?",
])
doc_embeddings = model.encode([
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
])
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 256) (1, 256)
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.7759],
# [0.3419]])
请注意与完整 768 维相似度的细微差异。
</details>
Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van der Maaten"]
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/modernbert-embed-base")
model = AutoModel.from_pretrained("nomic-ai/modernbert-embed-base")
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
queries_outputs = model(**encoded_queries)
documents_outputs = model(**encoded_documents)
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
print(query_embeddings.shape, doc_embeddings.shape)
# torch.Size([2, 768]) torch.Size([1, 768])
similarities = query_embeddings @ doc_embeddings.T
print(similarities)
# tensor([[0.7214],
# [0.3260]])
<details><summary>点击查看 Transformers 使用 Matryoshka 截断</summary>
在 transformers 中,您可以通过在归一化之前对平均池化后的嵌入进行切片来将其截断到更小的维度。
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van der Maaten"]
tokenizer = AutoTokenizer.from_pretrained(".")
model = AutoModel.from_pretrained(".")
truncate_dim = 256
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
queries_outputs = model(**encoded_queries)
documents_outputs = model(**encoded_documents)
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = doc_embeddings[:, :truncate_dim]
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
print(query_embeddings.shape, doc_embeddings.shape)
# torch.Size([2, 256]) torch.Size([1, 256])
similarities = query_embeddings @ doc_embeddings.T
print(similarities)
# tensor([[0.7759],
# [0.3419]])
请注意与完整 768 维相似度的细微差异。
</details>
Transformers.js
如果您还没有安装,可以从 NPM 安装 Transformers.js JavaScript 库:
npm i @huggingface/transformers
然后,您可以按如下方式计算嵌入:
import { pipeline, matmul } from '@huggingface/transformers';
// Create a feature extraction pipeline
const extractor = await pipeline(
"feature-extraction",
"nomic-ai/modernbert-embed-base",
{ dtype: "fp32" }, // Supported options: "fp32", "fp16", "q8", "q4", "q4f16"
);
// Embed queries and documents
const query_embeddings = await extractor([
"search_query: What is TSNE?",
"search_query: Who is Laurens van der Maaten?",
], { pooling: "mean", normalize: true },
);
const doc_embeddings = await extractor([
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van der Maaten",
], { pooling: "mean", normalize: true },
);
// Compute similarity scores
const similarities = await matmul(query_embeddings, doc_embeddings.transpose(1, 0));
console.log(similarities.tolist()); // [[0.721383273601532], [0.3259955644607544]]
训练
点击下方的 Nomic Atlas 地图来可视化我们 5M 对比预训练数据的样本!
我们使用多阶段训练流程来训练我们的嵌入器。从长上下文 BERT 模型 开始,第一阶段的非监督对比训练使用从弱相关文本对生成的数据集,如来自 StackExchange 和 Quora 等论坛的问答对、来自 Amazon 评论的标题-正文对,以及来自新闻文章的摘要。
在第二阶段的微调中,使用更高质量的标注数据集,如来自网络搜索的查询和答案。数据整理和困难样本挖掘在这一阶段至关重要。
更多详情请参阅 Nomic Embed 技术报告 和相应的 博客文章。
训练模型的训练数据已完整发布。更多详情请参阅 contrastors 仓库
加入 Nomic 社区
- Nomic: https://nomic.ai
- Discord: https://discord.gg/myY5YDR8z8
- Twitter: https://twitter.com/nomic_ai
引用
如果您发现该模型、数据集或训练代码有用,请引用我们的工作
@misc{nussbaum2024nomic,
title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
year={2024},
eprint={2402.01613},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
nomic-ai/modernbert-embed-base
作者 nomic-ai
创建时间: 2024-12-29 23:51:30+00:00
更新时间: 2025-01-24 15:31:46+00:00
在 Hugging Face 上查看
