ONNX 模型库
返回模型

说明文档

Granite-Embedding-30m-Sparse

模型概述: Granite-Embedding-30m-Sparse 是 Granite Experimental 系列中的 3000 万参数稀疏双编码器嵌入模型,可用于生成高质量文本嵌入。该模型生成可变长度的类词袋字典,包含句子词元的扩展及其对应权重,并使用具有宽松、企业友好许可的开源相关性配对数据集以及 IBM 收集和生成的数据集组合进行训练。在 BEIR 等学术基准测试中保持竞争力的同时,该模型在许多企业应用场景中也表现出色。该模型采用面向检索的预训练、对比微调和知识蒸馏技术进行开发,以提升性能。

支持的语言: 英语。

预期用途: 该模型旨在为给定文本生成可变长度的类词袋字典,包含句子词元的扩展及其对应权重,可用于文本相似性、检索和搜索应用。

与 Milvus 配合使用: 该模型与 Milvus 向量数据库兼容,使用非常简单:

首先,安装 pymilvus 库

pip install pymilvus[model]

然后可以使用该模型对文本对进行编码,并计算其表示之间的相似度


from pymilvus import model
from pymilvus import MilvusClient, DataType

client = MilvusClient("./milvus_demo.db")

client.drop_collection(collection_name="my_sparse_collection")

schema = client.create_schema(
    auto_id=True,
    enable_dynamic_fields=True,
)

schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="id", datatype=DataType.VARCHAR, is_primary=False, max_length=100)
schema.add_field(field_name="embeddings", datatype=DataType.SPARSE_FLOAT_VECTOR)

index_params = client.prepare_index_params()

index_params.add_index(field_name="embeddings",
                               index_name="sparse_inverted_index",
                               index_type="SPARSE_INVERTED_INDEX",
                               metric_type="IP",
                               params={"drop_ratio_build": 0.2})
client.create_collection(
    collection_name="my_sparse_collection",
    schema=schema,
    index_params=index_params
)

embeddings_model = model.sparse.SpladeEmbeddingFunction(
    model_name="ibm-granite/granite-embedding-30m-sparse", 
    device="cpu",
    batch_size=2,
    k_tokens_query=50,
    k_tokens_document=192
)

# 准备要导入的文档
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

# SpladeEmbeddingFunction.encode_documents 根据milvus-model版本返回稀疏矩阵或稀疏数组
# reshape(1,-1) 确保格式正确以便导入
doc_vector = [{"embeddings": doc_emb.reshape(1,-1), "id": f"item_{i}"} for i, doc_emb in enumerate(embeddings_model.encode_documents(docs))]


client.insert(
    collection_name="my_sparse_collection",
    data=doc_vector
)

# 准备搜索参数
search_params = {
    "params": {"drop_ratio_search": 0.2},  # 额外的可选搜索参数
}

# 准备查询向量

queries = [
      "When was artificial intelligence founded", 
      "Where was Turing born?"
]
query_vector = embeddings_model.encode_documents(queries)

res = client.search(
    collection_name="my_sparse_collection",
    data=query_vector,
    limit=1, #返回的前k个文档
    output_fields=["id"],
    search_params=search_params,
)

for r in res:
    print(r)

与 Sentence Transformers 配合使用:

首先安装 Sentence Transformers 库:

pip install -U sentence-transformers

然后您可以加载此模型并运行推理。


from sentence_transformers import SparseEncoder

# 从 🤗 Hub 下载
model = SparseEncoder("ibm-granite/granite-embedding-30m-sparse")

# 运行推理
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]
docs_embeddings = model.encode_document(docs, max_active_dims=192)
print(docs_embeddings.shape)
# [3, 50265]

queries = ["When was artificial intelligence founded", "Where was Turing born?"]
queries_embeddings = model.encode_query(queries, max_active_dims=50)
print(queries_embeddings.shape)
# [2, 50265]

# 获取嵌入的相似度分数
similarities = model.similarity(queries_embeddings, docs_embeddings)
print(similarities.shape)
# [2, 3]

for i, query in enumerate(queries):
    best_doc_index = similarities[i].argmax().item()

    print(f"Query: {query}")
    print(f"Best doc associate: Similarity: {similarities[i][best_doc_index]:.4f}, Doc: {docs[best_doc_index]}")
    intersection = model.intersection(queries_embeddings[i], docs_embeddings[best_doc_index])
    decoded_intersection = model.decode(intersection, top_k=10)
    print("Top 10 tokens influencing the similarity:")
    for token, score in decoded_intersection:
        print(f"Token: {token}, Score: {score:.4f}")

# Query: When was artificial intelligence founded
# Best doc associate: Similarity: 12.3641, Doc: Artificial intelligence was founded as an academic discipline in 1956.
# Top 10 tokens influencing the similarity:
# Token: ĠAI, Score: 2.7591
# Token: Ġintelligence, Score: 2.2971
# Token: Ġartificial, Score: 1.7654
# Token: Ġfounded, Score: 1.3254
# Token: Ġinvention, Score: 0.9808
# Token: Ġlearning, Score: 0.4847
# Token: Ġcomputer, Score: 0.4789
# Token: Ġrobot, Score: 0.3466
# Token: Ġestablishment, Score: 0.3371
# Token: Ġscientific, Score: 0.2804
# Query: Where was Turing born?
# Best doc associate: Similarity: 17.1359, Doc: Born in Maida Vale, London, Turing was raised in southern England.
# Top 10 tokens influencing the similarity:
# Token: uring, Score: 2.9761
# Token: ĠTuring, Score: 2.4544
# Token: Ġborn, Score: 2.4314
# Token: ing, Score: 1.7760
# Token: ure, Score: 1.7626
# Token: Ġcomput, Score: 1.3356
# Token: Ġraised, Score: 1.3285
# Token: able, Score: 1.1940
# Token: Ġphilosopher, Score: 0.4118
# Token: Ġmachine, Score: 0.3977

评估:

Granite-Embedding-30m-Sparse 在性能上与 naver/splade-v3-distilbert 相当,尽管参数量只有其一半。我们还将该稀疏模型与类似大小的稠密嵌入对应模型 ibm-granite/granite-embedding-30m-english 进行了比较。下表报告了模型在 MTEB 检索(即 BEIR)上的性能。为了与 naver/splade-v3-distilbert 报告的结果保持一致,我们在下表中不包含 CQADupstack 和 MS-MARCO。

模型 参数量 (M) 词表大小 BEIR 检索 (13)
naver/splade-v3-distilbert 67 30522 50.0
granite-embedding-30m-english 30 50265 50.6
granite-embedding-30m-sparse 30 50265 50.8

模型架构: Granite-Embedding-30m-Sparse 基于仅编码器的类 RoBERTa Transformer 架构,在 IBM Research 内部训练。

模型 granite-embedding-30m-sparse
嵌入维度 384
层数 6
注意力头数 12
中间层维度 1536
激活函数 GeLU
词表大小 50265
最大序列长度 512
参数量 30M

训练数据: 总体而言,训练数据由四个主要来源组成:(1)从网络爬取的无监督标题-正文配对数据,(2)具有宽松、企业友好许可的公开配对数据,(3)针对特定技术领域的 IBM 内部配对数据,以及(4)IBM 生成的合成数据。数据列表如下:

数据集 配对数量
SPECTER citation triplets 684,100
Stack Exchange Duplicate questions (titles) 304,525
Stack Exchange Duplicate questions (bodies) 250,519
Stack Exchange Duplicate questions (titles+bodies) 250,460
Natural Questions (NQ) 100,231
SQuAD2.0 87,599
PAQ (Question, Answer) pairs 64,371,441
Stack Exchange (Title, Answer) pairs 4,067,139
Stack Exchange (Title, Body) pairs 23,978,013
Stack Exchange (Title+Body, Answer) pairs 187,195
S2ORC Citation pairs (Titles) 52,603,982
S2ORC (Title, Abstract) 41,769,185
S2ORC (Citations, abstracts) 52,603,982
WikiAnswers Duplicate question pairs 77,427,422
SearchQA 582,261
HotpotQA 85,000
Fever 109,810
Arxiv 2,358,545
Wikipedia 20,745,403
PubMed 20,000,000
Miracl En Pairs 9,016
DBPedia Title-Body Pairs 4,635,922
Synthetic: Query-Wikipedia Passage 1,879,093
Synthetic: Fact Verification 9,888
IBM Internal Triples 40,290
IBM Internal Title-Body Pairs 1,524,586

值得注意的是,由于 MS-MARCO 检索数据集采用非商业许可证,我们未将其纳入训练语料库。

基础设施: 我们使用 IBM 的计算集群 Cognitive Compute Cluster 来训练 Granite 嵌入模型,该集群配备了 NVIDIA A100 80GB GPU。该集群为我们在多个 GPU 上训练模型提供了可扩展且高效的基础设施。

伦理考量和局限性: 用于训练基础语言模型的数据已经过过滤,以移除包含仇恨、辱骂和亵渎内容的文本。Granite-Embedding-30m-Sparse 仅针对英文文本进行训练,上下文长度为 512 个词元(较长的文本将被截断至此长度)。

资源

  • ⭐️ 了解 Granite 的最新更新:https://www.ibm.com/granite
  • 📄 通过教程、最佳实践和提示工程建议入门:https://www.ibm.com/granite/docs/
  • 💡 了解最新的 Granite 学习资源:https://ibm.biz/granite-learning-resources

引用

@misc{awasthy2025graniteembeddingmodels,
      title={Granite Embedding Models}, 
      author={Parul Awasthy and Aashka Trivedi and Yulong Li and Mihaela Bornea and David Cox and Abraham Daniels and Martin Franz and Gabe Goodhart and Bhavani Iyer and Vishwajeet Kumar and Luis Lastras and Scott McCarley and Rudra Murthy and Vignesh P and Sara Rosenthal and Salim Roukos and Jaydeep Sen and Sukriti Sharma and Avirup Sil and Kate Soule and Arafat Sultan and Radu Florian},
      year={2025},
      eprint={2502.20204},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2502.20204}, 
}

raul3820/granite-30m-sparse-onnx

作者 raul3820

feature-extraction sentence-transformers
↓ 1 ♥ 0

创建时间: 2026-01-03 22:41:20+00:00

更新时间: 2026-01-03 22:41:24+00:00

在 Hugging Face 上查看

文件 (15)

.gitattributes
1_SpladePooling/config.json
README.md
config.json
config_sentence_transformers.json
merges.txt
model.safetensors
modules.json
onnx/model_fp16.onnx ONNX
onnx/model_quantized.onnx ONNX
sentence_bert_config.json
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json