说明文档
opensearch-neural-sparse-encoding-doc-v3-distill-onnx
该模型是原始 opensearch-neural-sparse-encoding-doc-v3-distill 模型的 ONNX 导出版本,针对 ONNX 运行时推理进行了优化,并与 transformers.js 兼容。
选择模型
选择模型时应考虑搜索相关性、模型推理和检索效率(FLOPS)。我们在 BEIR 基准测试的子集上对模型性能进行了基准测试:TrecCovid、NFCorpus、NQ、HotpotQA、FiQA、ArguAna、Touche、DBPedia、SCIDOCS、FEVER、Climate FEVER、SciFact、Quora。
总体而言,v3 系列模型在搜索相关性、效率和推理速度方面均优于 v1 和 v2 系列。具体的优缺点可能因数据集而异。
| 模型 | 检索无需推理 | 模型参数量 | 平均 NDCG@10 | 平均 FLOPS |
|---|---|---|---|---|
| opensearch-neural-sparse-encoding-v1 | 133M | 0.524 | 11.4 | |
| opensearch-neural-sparse-encoding-v2-distill | 67M | 0.528 | 8.3 | |
| opensearch-neural-sparse-encoding-doc-v1 | ✔️ | 133M | 0.490 | 2.3 |
| opensearch-neural-sparse-encoding-doc-v2-distill | ✔️ | 67M | 0.504 | 1.8 |
| opensearch-neural-sparse-encoding-doc-v2-mini | ✔️ | 23M | 0.497 | 1.7 |
| opensearch-neural-sparse-encoding-doc-v3-distill | ✔️ | 67M | 0.517 | 1.8 |
| opensearch-neural-sparse-encoding-doc-v3-gte | ✔️ | 133M | 0.546 | 1.7 |
概述
- 论文: Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers
- 代码: opensearch-sparse-model-tuning-sample
这是一个学习型稀疏检索模型。它将文档编码为 30522 维的稀疏向量。对于查询,它仅使用 tokenizer 和权重查找表来生成稀疏向量。非零维度索引表示词表中对应的 token,权重表示该 token 的重要性。相似度分数是查询/文档稀疏向量的内积。
训练数据集包括 MS MARCO、eli5_question_answer、squad_pairs、WikiAnswers、yahoo_answers_title_question、gooaq_pairs、stackexchange_duplicate_questions_body_body、wikihow、S2ORC_title_abstract、stackexchange_duplicate_questions_title-body_title-body、yahoo_answers_question_answer、searchQA_top5_snippets、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer、fever、fiqa、hotpotqa、nfcorpus、scifact。
OpenSearch 神经稀疏特性支持使用 Lucene 倒排索引进行学习型稀疏检索。链接:https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/。可以使用 OpenSearch 高级 API 执行索引和搜索操作。
使用方法
首先安装 Sentence Transformers 库:
pip install -U sentence-transformers
然后您可以加载此模型并运行推理。
from sentence_transformers.sparse_encoder import SparseEncoder
# 从 🤗 Hub 下载
model = SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")
query = "What's the weather in ny now?"
document = "Currently New York is rainy."
query_embed = model.encode_query(query)
document_embed = model.encode_document(document)
sim = model.similarity(query_embed, document_embed)
print(f"Similarity: {sim}")
# Similarity: tensor([[11.1105]])
decoded_query = model.decode(query_embed)
decoded_document = model.decode(document_embed)
for i in range(len(decoded_query)):
query_token, query_score = decoded_query[i]
doc_score = next((score for token, score in decoded_document if token == query_token), 0)
if doc_score != 0:
print(f"Token: {query_token}, Query score: {query_score:.4f}, Document score: {doc_score:.4f}")
# Token: ny, Query score: 5.7729, Document score: 0.8049
# Token: weather, Query score: 4.5684, Document score: 0.9710
# Token: now, Query score: 3.5895, Document score: 0.4720
# Token: ?, Query score: 3.3313, Document score: 0.0286
# Token: what, Query score: 2.7699, Document score: 0.0787
# Token: in, Query score: 0.4989, Document score: 0.0417
使用方法
该模型原本设计在 OpenSearch 集群内运行。但您也可以在集群外部使用 HuggingFace 模型 API 来使用它。
import json
import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
# 从形状为 batch_size * seq_len * vocab_size 的稠密向量获取稀疏向量
def get_sparse_vector(feature, output):
values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
# 注意我们更新了 v3 模型的激活函数
values = torch.log(1 + torch.log(1 + torch.relu(values)))
values[:,special_token_ids] = 0
return values
# 将稀疏向量转换为 (token, weight) 字典
def transform_sparse_vector_to_dict(sparse_vector):
sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]
output = []
end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
for i in range(len(end_idxs)-1):
token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
output.append(dict(zip(token_strings, weights)))
return output
# 从模型 hub 下载 idf 文件。idf 用于为查询 token 提供权重
def get_tokenizer_idf(tokenizer):
from huggingface_hub import hf_hub_download
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx", filename="idf.json")
with open(local_cached_path) as f:
idf = json.load(f)
idf_vector = [0]*tokenizer.vocab_size
for token,weight in idf.items():
_id = tokenizer._convert_token_to_id_with_added_voc(token)
idf_vector[_id]=weight
return torch.tensor(idf_vector)
# 加载模型
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill-onnx")
idf = get_tokenizer_idf(tokenizer)
# 设置特殊 token 和 id_to_token 转换用于后处理
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token
query = "What's the weather in ny now?"
document = "Currently New York is rainy."
# 编码查询
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf
# 编码文档
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)
# 获取相似度分数
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score) # tensor(11.1105, grad_fn=<DotBackward0>)
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
if token in document_query_token_weight:
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
# 结果:
# score in query: 5.7729, score in document: 0.8049, token: ny
# score in query: 4.5684, score in document: 0.9710, token: weather
# score in query: 3.5895, score in document: 0.4720, token: now
# score in query: 3.3313, score in document: 0.0286, token: ?
# score in query: 2.7699, score in document: 0.0787, token: what
# score in query: 0.4989, score in document: 0.0417, token: in
上述代码示例展示了神经稀疏搜索的一个示例。尽管原始查询和文档中没有重叠的 token,但该模型仍实现了良好的匹配。
详细搜索相关性
<div style="overflow-x: auto;">
| 模型 | 平均值 | Trec Covid | NFCorpus | NQ | HotpotQA | FiQA | ArguAna | Touche | DBPedia | SCIDOCS | FEVER | Climate FEVER | SciFact | Quora |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| opensearch-neural-sparse-encoding-v1 | 0.524 | 0.771 | 0.360 | 0.553 | 0.697 | 0.376 | 0.508 | 0.278 | 0.447 | 0.164 | 0.821 | 0.263 | 0.723 | 0.856 |
| opensearch-neural-sparse-encoding-v2-distill | 0.528 | 0.775 | 0.347 | 0.561 | 0.685 | 0.374 | 0.551 | 0.278 | 0.435 | 0.173 | 0.849 | 0.249 | 0.722 | 0.863 |
| opensearch-neural-sparse-encoding-doc-v1 | 0.490 | 0.707 | 0.352 | 0.521 | 0.677 | 0.344 | 0.461 | 0.294 | 0.412 | 0.154 | 0.743 | 0.202 | 0.716 | 0.788 |
| opensearch-neural-sparse-encoding-doc-v2-distill | 0.504 | 0.690 | 0.343 | 0.528 | 0.675 | 0.357 | 0.496 | 0.287 | 0.418 | 0.166 | 0.818 | 0.224 | 0.715 | 0.841 |
| opensearch-neural-sparse-encoding-doc-v2-mini | 0.497 | 0.709 | 0.336 | 0.510 | 0.666 | 0.338 | 0.480 | 0.285 | 0.407 | 0.164 | 0.812 | 0.216 | 0.699 | 0.837 |
| opensearch-neural-sparse-encoding-doc-v3-distill | 0.517 | 0.724 | 0.345 | 0.544 | 0.694 | 0.356 | 0.520 | 0.294 | 0.424 | 0.163 | 0.845 | 0.239 | 0.708 | 0.863 |
| opensearch-neural-sparse-encoding-doc-v3-gte | 0.546 | 0.734 | 0.360 | 0.582 | 0.716 | 0.407 | 0.520 | 0.389 | 0.455 | 0.167 | 0.860 | 0.312 | 0.725 | 0.873 |
</div>
许可证
本项目基于 Apache v2.0 许可证 授权。
版权声明
版权所有 OpenSearch 贡献者。详见 NOTICE。
raul3820/opensearch-neural-sparse-encoding-doc-v3-distill-onnx
作者 raul3820
创建时间: 2026-01-04 10:04:01+00:00
更新时间: 2026-01-04 10:04:03+00:00
在 Hugging Face 上查看