说明文档

opensearch-neural-sparse-encoding-doc-v3-distill-onnx

该模型是原始 opensearch-neural-sparse-encoding-doc-v3-distill 模型的 ONNX 导出版本，针对 ONNX 运行时推理进行了优化，并与 transformers.js 兼容。

选择模型

选择模型时应考虑搜索相关性、模型推理和检索效率（FLOPS）。我们在 BEIR 基准测试的子集上对模型性能进行了基准测试：TrecCovid、NFCorpus、NQ、HotpotQA、FiQA、ArguAna、Touche、DBPedia、SCIDOCS、FEVER、Climate FEVER、SciFact、Quora。

总体而言，v3 系列模型在搜索相关性、效率和推理速度方面均优于 v1 和 v2 系列。具体的优缺点可能因数据集而异。

模型	检索无需推理	模型参数量	平均 NDCG@10	平均 FLOPS
opensearch-neural-sparse-encoding-v1		133M	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		67M	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	133M	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	67M	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	23M	0.497	1.7
opensearch-neural-sparse-encoding-doc-v3-distill	✔️	67M	0.517	1.8
opensearch-neural-sparse-encoding-doc-v3-gte	✔️	133M	0.546	1.7

概述

这是一个学习型稀疏检索模型。它将文档编码为 30522 维的稀疏向量。对于查询，它仅使用 tokenizer 和权重查找表来生成稀疏向量。非零维度索引表示词表中对应的 token，权重表示该 token 的重要性。相似度分数是查询/文档稀疏向量的内积。

训练数据集包括 MS MARCO、eli5_question_answer、squad_pairs、WikiAnswers、yahoo_answers_title_question、gooaq_pairs、stackexchange_duplicate_questions_body_body、wikihow、S2ORC_title_abstract、stackexchange_duplicate_questions_title-body_title-body、yahoo_answers_question_answer、searchQA_top5_snippets、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer、fever、fiqa、hotpotqa、nfcorpus、scifact。

OpenSearch 神经稀疏特性支持使用 Lucene 倒排索引进行学习型稀疏检索。链接：https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/。可以使用 OpenSearch 高级 API 执行索引和搜索操作。

使用方法

首先安装 Sentence Transformers 库：

pip install -U sentence-transformers

然后您可以加载此模型并运行推理。

from sentence_transformers.sparse_encoder import SparseEncoder

# 从 🤗 Hub 下载
model = SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")

query = "What's the weather in ny now?"
document = "Currently New York is rainy."

query_embed = model.encode_query(query)
document_embed = model.encode_document(document)

sim = model.similarity(query_embed, document_embed)
print(f"Similarity: {sim}")
# Similarity: tensor([[11.1105]])

decoded_query = model.decode(query_embed)
decoded_document = model.decode(document_embed)

for i in range(len(decoded_query)):
    query_token, query_score = decoded_query[i]
    doc_score = next((score for token, score in decoded_document if token == query_token), 0)
    if doc_score != 0:
        print(f"Token: {query_token}, Query score: {query_score:.4f}, Document score: {doc_score:.4f}")

# Token: ny, Query score: 5.7729, Document score: 0.8049
# Token: weather, Query score: 4.5684, Document score: 0.9710
# Token: now, Query score: 3.5895, Document score: 0.4720
# Token: ?, Query score: 3.3313, Document score: 0.0286
# Token: what, Query score: 2.7699, Document score: 0.0787
# Token: in, Query score: 0.4989, Document score: 0.0417

使用方法

该模型原本设计在 OpenSearch 集群内运行。但您也可以在集群外部使用 HuggingFace 模型 API 来使用它。

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# 从形状为 batch_size * seq_len * vocab_size 的稠密向量获取稀疏向量
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    # 注意我们更新了 v3 模型的激活函数
    values = torch.log(1 + torch.log(1 + torch.relu(values)))
    values[:,special_token_ids] = 0
    return values
    
# 将稀疏向量转换为 (token, weight) 字典
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# 从模型 hub 下载 idf 文件。idf 用于为查询 token 提供权重
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# 加载模型
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")
idf = get_tokenizer_idf(tokenizer)

# 设置特殊 token 和 id_to_token 转换用于后处理
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# 编码查询
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# 编码文档
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# 获取相似度分数
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(11.1105, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# 结果:
# score in query: 5.7729, score in document: 0.8049, token: ny
# score in query: 4.5684, score in document: 0.9710, token: weather
# score in query: 3.5895, score in document: 0.4720, token: now
# score in query: 3.3313, score in document: 0.0286, token: ?
# score in query: 2.7699, score in document: 0.0787, token: what
# score in query: 0.4989, score in document: 0.0417, token: in

上述代码示例展示了神经稀疏搜索的一个示例。尽管原始查询和文档中没有重叠的 token，但该模型仍实现了良好的匹配。

详细搜索相关性

模型	平均值	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837
opensearch-neural-sparse-encoding-doc-v3-distill	0.517	0.724	0.345	0.544	0.694	0.356	0.520	0.294	0.424	0.163	0.845	0.239	0.708	0.863
opensearch-neural-sparse-encoding-doc-v3-gte	0.546	0.734	0.360	0.582	0.716	0.407	0.520	0.389	0.455	0.167	0.860	0.312	0.725	0.873

</div>

许可证

本项目基于 Apache v2.0 许可证授权。

版权声明

raul3820/opensearch-neural-sparse-encoding-doc-v3-distill-onnx

作者 raul3820

feature-extraction sentence-transformers

↓ 0 ♥ 0

创建时间: 2026-01-04 10:04:01+00:00

更新时间: 2026-01-04 10:04:03+00:00

在 Hugging Face 上查看

文件 (20)

.gitattributes

README.md

config.json

config_sentence_transformers.json

document_0_MLMTransformer/config.json

document_0_MLMTransformer/onnx/model.onnx ONNX

document_0_MLMTransformer/sentence_bert_config.json

document_0_MLMTransformer/special_tokens_map.json

document_0_MLMTransformer/tokenizer.json

document_0_MLMTransformer/tokenizer_config.json

document_0_MLMTransformer/vocab.txt

document_1_SpladePooling/config.json

modules.json

query_0_SparseStaticEmbedding/config.json

query_0_SparseStaticEmbedding/model.safetensors

query_0_SparseStaticEmbedding/special_tokens_map.json

query_0_SparseStaticEmbedding/tokenizer.json

query_0_SparseStaticEmbedding/tokenizer_config.json

query_0_SparseStaticEmbedding/vocab.txt

router_config.json