说明文档
Granite-Embedding-30m-Sparse
模型概述: Granite-Embedding-30m-Sparse 是 Granite Experimental 系列中的 3000 万参数稀疏双编码器嵌入模型,可用于生成高质量文本嵌入。该模型生成可变长度的类词袋字典,包含句子词元的扩展及其对应权重,并使用具有宽松、企业友好许可的开源相关性配对数据集以及 IBM 收集和生成的数据集组合进行训练。在 BEIR 等学术基准测试中保持竞争力的同时,该模型在许多企业应用场景中也表现出色。该模型采用面向检索的预训练、对比微调和知识蒸馏技术进行开发,以提升性能。
- 开发者: Granite Embedding Team, IBM
- GitHub 仓库: ibm-granite/granite-embedding-models
- 论文: 技术报告
- 发布日期: 2025年2月26日
- 许可证: Apache 2.0
支持的语言: 英语。
预期用途: 该模型旨在为给定文本生成可变长度的类词袋字典,包含句子词元的扩展及其对应权重,可用于文本相似性、检索和搜索应用。
与 Milvus 配合使用: 该模型与 Milvus 向量数据库兼容,使用非常简单:
首先,安装 pymilvus 库
pip install pymilvus[model]
然后可以使用该模型对文本对进行编码,并计算其表示之间的相似度
from pymilvus import model
from pymilvus import MilvusClient, DataType
client = MilvusClient("./milvus_demo.db")
client.drop_collection(collection_name="my_sparse_collection")
schema = client.create_schema(
auto_id=True,
enable_dynamic_fields=True,
)
schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="id", datatype=DataType.VARCHAR, is_primary=False, max_length=100)
schema.add_field(field_name="embeddings", datatype=DataType.SPARSE_FLOAT_VECTOR)
index_params = client.prepare_index_params()
index_params.add_index(field_name="embeddings",
index_name="sparse_inverted_index",
index_type="SPARSE_INVERTED_INDEX",
metric_type="IP",
params={"drop_ratio_build": 0.2})
client.create_collection(
collection_name="my_sparse_collection",
schema=schema,
index_params=index_params
)
embeddings_model = model.sparse.SpladeEmbeddingFunction(
model_name="ibm-granite/granite-embedding-30m-sparse",
device="cpu",
batch_size=2,
k_tokens_query=50,
k_tokens_document=192
)
# 准备要导入的文档
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
# SpladeEmbeddingFunction.encode_documents 根据milvus-model版本返回稀疏矩阵或稀疏数组
# reshape(1,-1) 确保格式正确以便导入
doc_vector = [{"embeddings": doc_emb.reshape(1,-1), "id": f"item_{i}"} for i, doc_emb in enumerate(embeddings_model.encode_documents(docs))]
client.insert(
collection_name="my_sparse_collection",
data=doc_vector
)
# 准备搜索参数
search_params = {
"params": {"drop_ratio_search": 0.2}, # 额外的可选搜索参数
}
# 准备查询向量
queries = [
"When was artificial intelligence founded",
"Where was Turing born?"
]
query_vector = embeddings_model.encode_documents(queries)
res = client.search(
collection_name="my_sparse_collection",
data=query_vector,
limit=1, #返回的前k个文档
output_fields=["id"],
search_params=search_params,
)
for r in res:
print(r)
与 Sentence Transformers 配合使用:
首先安装 Sentence Transformers 库:
pip install -U sentence-transformers
然后您可以加载此模型并运行推理。
from sentence_transformers import SparseEncoder
# 从 🤗 Hub 下载
model = SparseEncoder("ibm-granite/granite-embedding-30m-sparse")
# 运行推理
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
docs_embeddings = model.encode_document(docs, max_active_dims=192)
print(docs_embeddings.shape)
# [3, 50265]
queries = ["When was artificial intelligence founded", "Where was Turing born?"]
queries_embeddings = model.encode_query(queries, max_active_dims=50)
print(queries_embeddings.shape)
# [2, 50265]
# 获取嵌入的相似度分数
similarities = model.similarity(queries_embeddings, docs_embeddings)
print(similarities.shape)
# [2, 3]
for i, query in enumerate(queries):
best_doc_index = similarities[i].argmax().item()
print(f"Query: {query}")
print(f"Best doc associate: Similarity: {similarities[i][best_doc_index]:.4f}, Doc: {docs[best_doc_index]}")
intersection = model.intersection(queries_embeddings[i], docs_embeddings[best_doc_index])
decoded_intersection = model.decode(intersection, top_k=10)
print("Top 10 tokens influencing the similarity:")
for token, score in decoded_intersection:
print(f"Token: {token}, Score: {score:.4f}")
# Query: When was artificial intelligence founded
# Best doc associate: Similarity: 12.3641, Doc: Artificial intelligence was founded as an academic discipline in 1956.
# Top 10 tokens influencing the similarity:
# Token: ĠAI, Score: 2.7591
# Token: Ġintelligence, Score: 2.2971
# Token: Ġartificial, Score: 1.7654
# Token: Ġfounded, Score: 1.3254
# Token: Ġinvention, Score: 0.9808
# Token: Ġlearning, Score: 0.4847
# Token: Ġcomputer, Score: 0.4789
# Token: Ġrobot, Score: 0.3466
# Token: Ġestablishment, Score: 0.3371
# Token: Ġscientific, Score: 0.2804
# Query: Where was Turing born?
# Best doc associate: Similarity: 17.1359, Doc: Born in Maida Vale, London, Turing was raised in southern England.
# Top 10 tokens influencing the similarity:
# Token: uring, Score: 2.9761
# Token: ĠTuring, Score: 2.4544
# Token: Ġborn, Score: 2.4314
# Token: ing, Score: 1.7760
# Token: ure, Score: 1.7626
# Token: Ġcomput, Score: 1.3356
# Token: Ġraised, Score: 1.3285
# Token: able, Score: 1.1940
# Token: Ġphilosopher, Score: 0.4118
# Token: Ġmachine, Score: 0.3977
评估:
Granite-Embedding-30m-Sparse 在性能上与 naver/splade-v3-distilbert 相当,尽管参数量只有其一半。我们还将该稀疏模型与类似大小的稠密嵌入对应模型 ibm-granite/granite-embedding-30m-english 进行了比较。下表报告了模型在 MTEB 检索(即 BEIR)上的性能。为了与 naver/splade-v3-distilbert 报告的结果保持一致,我们在下表中不包含 CQADupstack 和 MS-MARCO。
| 模型 | 参数量 (M) | 词表大小 | BEIR 检索 (13) |
|---|---|---|---|
| naver/splade-v3-distilbert | 67 | 30522 | 50.0 |
| granite-embedding-30m-english | 30 | 50265 | 50.6 |
| granite-embedding-30m-sparse | 30 | 50265 | 50.8 |
模型架构: Granite-Embedding-30m-Sparse 基于仅编码器的类 RoBERTa Transformer 架构,在 IBM Research 内部训练。
| 模型 | granite-embedding-30m-sparse |
|---|---|
| 嵌入维度 | 384 |
| 层数 | 6 |
| 注意力头数 | 12 |
| 中间层维度 | 1536 |
| 激活函数 | GeLU |
| 词表大小 | 50265 |
| 最大序列长度 | 512 |
| 参数量 | 30M |
训练数据: 总体而言,训练数据由四个主要来源组成:(1)从网络爬取的无监督标题-正文配对数据,(2)具有宽松、企业友好许可的公开配对数据,(3)针对特定技术领域的 IBM 内部配对数据,以及(4)IBM 生成的合成数据。数据列表如下:
| 数据集 | 配对数量 |
|---|---|
| SPECTER citation triplets | 684,100 |
| Stack Exchange Duplicate questions (titles) | 304,525 |
| Stack Exchange Duplicate questions (bodies) | 250,519 |
| Stack Exchange Duplicate questions (titles+bodies) | 250,460 |
| Natural Questions (NQ) | 100,231 |
| SQuAD2.0 | 87,599 |
| PAQ (Question, Answer) pairs | 64,371,441 |
| Stack Exchange (Title, Answer) pairs | 4,067,139 |
| Stack Exchange (Title, Body) pairs | 23,978,013 |
| Stack Exchange (Title+Body, Answer) pairs | 187,195 |
| S2ORC Citation pairs (Titles) | 52,603,982 |
| S2ORC (Title, Abstract) | 41,769,185 |
| S2ORC (Citations, abstracts) | 52,603,982 |
| WikiAnswers Duplicate question pairs | 77,427,422 |
| SearchQA | 582,261 |
| HotpotQA | 85,000 |
| Fever | 109,810 |
| Arxiv | 2,358,545 |
| Wikipedia | 20,745,403 |
| PubMed | 20,000,000 |
| Miracl En Pairs | 9,016 |
| DBPedia Title-Body Pairs | 4,635,922 |
| Synthetic: Query-Wikipedia Passage | 1,879,093 |
| Synthetic: Fact Verification | 9,888 |
| IBM Internal Triples | 40,290 |
| IBM Internal Title-Body Pairs | 1,524,586 |
值得注意的是,由于 MS-MARCO 检索数据集采用非商业许可证,我们未将其纳入训练语料库。
基础设施: 我们使用 IBM 的计算集群 Cognitive Compute Cluster 来训练 Granite 嵌入模型,该集群配备了 NVIDIA A100 80GB GPU。该集群为我们在多个 GPU 上训练模型提供了可扩展且高效的基础设施。
伦理考量和局限性: 用于训练基础语言模型的数据已经过过滤,以移除包含仇恨、辱骂和亵渎内容的文本。Granite-Embedding-30m-Sparse 仅针对英文文本进行训练,上下文长度为 512 个词元(较长的文本将被截断至此长度)。
资源
- ⭐️ 了解 Granite 的最新更新:https://www.ibm.com/granite
- 📄 通过教程、最佳实践和提示工程建议入门:https://www.ibm.com/granite/docs/
- 💡 了解最新的 Granite 学习资源:https://ibm.biz/granite-learning-resources
引用
@misc{awasthy2025graniteembeddingmodels,
title={Granite Embedding Models},
author={Parul Awasthy and Aashka Trivedi and Yulong Li and Mihaela Bornea and David Cox and Abraham Daniels and Martin Franz and Gabe Goodhart and Bhavani Iyer and Vishwajeet Kumar and Luis Lastras and Scott McCarley and Rudra Murthy and Vignesh P and Sara Rosenthal and Salim Roukos and Jaydeep Sen and Sukriti Sharma and Avirup Sil and Kate Soule and Arafat Sultan and Radu Florian},
year={2025},
eprint={2502.20204},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2502.20204},
}
raul3820/granite-30m-sparse-onnx
作者 raul3820
创建时间: 2026-01-03 22:41:20+00:00
更新时间: 2026-01-03 22:41:24+00:00
在 Hugging Face 上查看