ONNX 模型库
返回模型

说明文档

opensearch-neural-sparse-encoding-doc-v3-distill-onnx

该模型是原始 opensearch-neural-sparse-encoding-doc-v3-distill 模型的 ONNX 导出版本,针对 ONNX 运行时推理进行了优化,并与 transformers.js 兼容。

选择模型

选择模型时应考虑搜索相关性、模型推理和检索效率(FLOPS)。我们在 BEIR 基准测试的子集上对模型性能进行了基准测试:TrecCovid、NFCorpus、NQ、HotpotQA、FiQA、ArguAna、Touche、DBPedia、SCIDOCS、FEVER、Climate FEVER、SciFact、Quora。

总体而言,v3 系列模型在搜索相关性、效率和推理速度方面均优于 v1 和 v2 系列。具体的优缺点可能因数据集而异。

模型 检索无需推理 模型参数量 平均 NDCG@10 平均 FLOPS
opensearch-neural-sparse-encoding-v1 133M 0.524 11.4
opensearch-neural-sparse-encoding-v2-distill 67M 0.528 8.3
opensearch-neural-sparse-encoding-doc-v1 ✔️ 133M 0.490 2.3
opensearch-neural-sparse-encoding-doc-v2-distill ✔️ 67M 0.504 1.8
opensearch-neural-sparse-encoding-doc-v2-mini ✔️ 23M 0.497 1.7
opensearch-neural-sparse-encoding-doc-v3-distill ✔️ 67M 0.517 1.8
opensearch-neural-sparse-encoding-doc-v3-gte ✔️ 133M 0.546 1.7

概述

这是一个学习型稀疏检索模型。它将文档编码为 30522 维的稀疏向量。对于查询,它仅使用 tokenizer 和权重查找表来生成稀疏向量。非零维度索引表示词表中对应的 token,权重表示该 token 的重要性。相似度分数是查询/文档稀疏向量的内积。

训练数据集包括 MS MARCO、eli5_question_answer、squad_pairs、WikiAnswers、yahoo_answers_title_question、gooaq_pairs、stackexchange_duplicate_questions_body_body、wikihow、S2ORC_title_abstract、stackexchange_duplicate_questions_title-body_title-body、yahoo_answers_question_answer、searchQA_top5_snippets、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer、fever、fiqa、hotpotqa、nfcorpus、scifact。

OpenSearch 神经稀疏特性支持使用 Lucene 倒排索引进行学习型稀疏检索。链接:https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/。可以使用 OpenSearch 高级 API 执行索引和搜索操作。

使用方法

首先安装 Sentence Transformers 库:

pip install -U sentence-transformers

然后您可以加载此模型并运行推理。

from sentence_transformers.sparse_encoder import SparseEncoder

# 从 🤗 Hub 下载
model = SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")

query = "What's the weather in ny now?"
document = "Currently New York is rainy."

query_embed = model.encode_query(query)
document_embed = model.encode_document(document)

sim = model.similarity(query_embed, document_embed)
print(f"Similarity: {sim}")
# Similarity: tensor([[11.1105]])

decoded_query = model.decode(query_embed)
decoded_document = model.decode(document_embed)

for i in range(len(decoded_query)):
    query_token, query_score = decoded_query[i]
    doc_score = next((score for token, score in decoded_document if token == query_token), 0)
    if doc_score != 0:
        print(f"Token: {query_token}, Query score: {query_score:.4f}, Document score: {doc_score:.4f}")

# Token: ny, Query score: 5.7729, Document score: 0.8049
# Token: weather, Query score: 4.5684, Document score: 0.9710
# Token: now, Query score: 3.5895, Document score: 0.4720
# Token: ?, Query score: 3.3313, Document score: 0.0286
# Token: what, Query score: 2.7699, Document score: 0.0787
# Token: in, Query score: 0.4989, Document score: 0.0417

使用方法

该模型原本设计在 OpenSearch 集群内运行。但您也可以在集群外部使用 HuggingFace 模型 API 来使用它。

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# 从形状为 batch_size * seq_len * vocab_size 的稠密向量获取稀疏向量
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    # 注意我们更新了 v3 模型的激活函数
    values = torch.log(1 + torch.log(1 + torch.relu(values)))
    values[:,special_token_ids] = 0
    return values
    
# 将稀疏向量转换为 (token, weight) 字典
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# 从模型 hub 下载 idf 文件。idf 用于为查询 token 提供权重
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# 加载模型
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")
idf = get_tokenizer_idf(tokenizer)

# 设置特殊 token 和 id_to_token 转换用于后处理
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# 编码查询
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# 编码文档
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# 获取相似度分数
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(11.1105, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# 结果:
# score in query: 5.7729, score in document: 0.8049, token: ny
# score in query: 4.5684, score in document: 0.9710, token: weather
# score in query: 3.5895, score in document: 0.4720, token: now
# score in query: 3.3313, score in document: 0.0286, token: ?
# score in query: 2.7699, score in document: 0.0787, token: what
# score in query: 0.4989, score in document: 0.0417, token: in

上述代码示例展示了神经稀疏搜索的一个示例。尽管原始查询和文档中没有重叠的 token,但该模型仍实现了良好的匹配。

详细搜索相关性

<div style="overflow-x: auto;">

模型 平均值 Trec Covid NFCorpus NQ HotpotQA FiQA ArguAna Touche DBPedia SCIDOCS FEVER Climate FEVER SciFact Quora
opensearch-neural-sparse-encoding-v1 0.524 0.771 0.360 0.553 0.697 0.376 0.508 0.278 0.447 0.164 0.821 0.263 0.723 0.856
opensearch-neural-sparse-encoding-v2-distill 0.528 0.775 0.347 0.561 0.685 0.374 0.551 0.278 0.435 0.173 0.849 0.249 0.722 0.863
opensearch-neural-sparse-encoding-doc-v1 0.490 0.707 0.352 0.521 0.677 0.344 0.461 0.294 0.412 0.154 0.743 0.202 0.716 0.788
opensearch-neural-sparse-encoding-doc-v2-distill 0.504 0.690 0.343 0.528 0.675 0.357 0.496 0.287 0.418 0.166 0.818 0.224 0.715 0.841
opensearch-neural-sparse-encoding-doc-v2-mini 0.497 0.709 0.336 0.510 0.666 0.338 0.480 0.285 0.407 0.164 0.812 0.216 0.699 0.837
opensearch-neural-sparse-encoding-doc-v3-distill 0.517 0.724 0.345 0.544 0.694 0.356 0.520 0.294 0.424 0.163 0.845 0.239 0.708 0.863
opensearch-neural-sparse-encoding-doc-v3-gte 0.546 0.734 0.360 0.582 0.716 0.407 0.520 0.389 0.455 0.167 0.860 0.312 0.725 0.873

</div>

许可证

本项目基于 Apache v2.0 许可证 授权。

版权声明

版权所有 OpenSearch 贡献者。详见 NOTICE

raul3820/opensearch-neural-sparse-encoding-doc-v3-distill-onnx

作者 raul3820

feature-extraction sentence-transformers
↓ 0 ♥ 0

创建时间: 2026-01-04 10:04:01+00:00

更新时间: 2026-01-04 10:04:03+00:00

在 Hugging Face 上查看

文件 (20)

.gitattributes
README.md
config.json
config_sentence_transformers.json
document_0_MLMTransformer/config.json
document_0_MLMTransformer/onnx/model.onnx ONNX
document_0_MLMTransformer/sentence_bert_config.json
document_0_MLMTransformer/special_tokens_map.json
document_0_MLMTransformer/tokenizer.json
document_0_MLMTransformer/tokenizer_config.json
document_0_MLMTransformer/vocab.txt
document_1_SpladePooling/config.json
modules.json
query_0_SparseStaticEmbedding/config.json
query_0_SparseStaticEmbedding/model.safetensors
query_0_SparseStaticEmbedding/special_tokens_map.json
query_0_SparseStaticEmbedding/tokenizer.json
query_0_SparseStaticEmbedding/tokenizer_config.json
query_0_SparseStaticEmbedding/vocab.txt
router_config.json