说明文档

SPLADE CoCondenser EnsembleDistil

用于段落检索的 SPLADE 模型。更多详情，请访问：

论文：https://arxiv.org/abs/2205.04733
代码：https://github.com/naver/splade

	MRR@10 (MS MARCO dev)	R@1000 (MS MARCO dev)
`splade-cocondenser-ensembledistil`	38.3	98.3

模型详情

这是一个 SPLADE 稀疏编码器模型。它将句子和段落映射到 30522 维的稀疏向量空间，可用于语义搜索和稀疏检索。

模型描述

模型类型： SPLADE 稀疏编码器
基础模型： Luyu/co-condenser-marco
最大序列长度： 512 个 token（复现评估时为 256）
输出维度： 30522 维
相似度函数： 点积

完整模型架构

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False}) with MLMTransformer model: BertForMaskedLM 
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)

使用方法

直接使用（Sentence Transformers）

首先安装 Sentence Transformers 库：

pip install -U sentence-transformers

然后你可以加载此模型并运行推理。

from sentence_transformers import SparseEncoder

# 从 🤗 Hub 下载
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 运行推理
queries = ["what causes aging fast"]
documents = [
    "UV-A light, specifically, is what mainly causes tanning, skin aging, and cataracts, UV-B causes sunburn, skin aging and skin cancer, and UV-C is the strongest, and therefore most effective at killing microorganisms. Again â\x80\x93 single words and multiple bullets.",
    "Answers from Ronald Petersen, M.D. Yes, Alzheimer's disease usually worsens slowly. But its speed of progression varies, depending on a person's genetic makeup, environmental factors, age at diagnosis and other medical conditions. Still, anyone diagnosed with Alzheimer's whose symptoms seem to be progressing quickly â\x80\x94 or who experiences a sudden decline â\x80\x94 should see his or her doctor.",
    "Bell's palsy and Extreme tiredness and Extreme fatigue (2 causes) Bell's palsy and Extreme tiredness and Hepatitis (2 causes) Bell's palsy and Extreme tiredness and Liver pain (2 causes) Bell's palsy and Extreme tiredness and Lymph node swelling in children (2 causes)",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 30522] [3, 30522]

# 获取嵌入的相似度分数
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 9.9933, 10.8691,  3.4265]])

引用

如果你使用我们的检查点，请引用我们的工作：

@misc{https://doi.org/10.48550/arxiv.2205.04733,
  doi = {10.48550/ARXIV.2205.04733},
  url = {https://arxiv.org/abs/2205.04733},
  author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, Stéphane},
  keywords = {Information Retrieval (cs.IR), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}

xuanan2001/splade-cocondenser-ensembledistil-onnx

作者 xuanan2001

feature-extraction sentence-transformers

↓ 0 ♥ 0

创建时间: 2025-11-16 05:02:00+00:00

更新时间: 2025-11-16 05:10:25+00:00

在 Hugging Face 上查看

文件 (18)

.gitattributes

1_SpladePooling/config.json

README.md

config.json

config_sentence_transformers.json

modules.json

onnx/config.json

onnx/model.onnx ONNX

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

onnx/vocab.txt

pytorch_model.bin

sentence_bert_config.json

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.txt