说明文档

license: apache-2.0 language:

en datasets:
ms_marco tags:
splade++
document-expansion
sparse representation
bag-of-words
passage-retrieval
knowledge-distillation
document encoder pretty_name: Independent Implementation of SPLADE++ Model with some efficiency tweaks for Industry setting. library_name: transformers pipeline_tag: fill-mask

SPLADE++ 模型的独立实现（`又称 splade-cocondenser* 及其系列`）面向工业场景

本工作建立在两项稳健研究的基础之上：Naver 的《From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective》论文和 Google 的 SparseEmbed。感谢两个团队的出色工作。

这是本系列的第二个版本。在此尝试 V1： prithivida/Splade_PP_en_v1

1. 什么是稀疏表示，为什么要学习它？

初学者？ 展开此部分。稀疏和稠密表示的专家？ 可以直接跳到下一节第 2 部分。

1. 词汇搜索：

基于 BOW（词袋）稀疏向量的词汇搜索是强有力的基准，但它们存在著名的词汇不匹配问题，因为它们只能进行精确的词汇匹配。以下是优缺点：

✅ 高效且成本低。
✅ 无需微调模型。
✅️ 可解释性强。
✅️ 精确词汇匹配。
❌ 词汇不匹配（需要记住精确的词汇）

2. 语义搜索：

结合近似最近邻搜索的学习型神经/稠密检索器（DPR、Sentence transformers*、BGE* 模型）已展现出令人印象深刻的结果。以下是优缺点：

✅ 像人类天生那样思考进行搜索。
✅ 微调后远超稀疏方法。
✅ 轻松支持多模态。
❌ 存在词汇遗忘（错过词汇匹配），
❌ 资源消耗大（索引和检索），
❌ 著名地难以解释。
❌ 对 OOD（域外）数据需要微调。

3. 核心思想：

结合两种搜索的优点是有意义的，这催生了学习查询和文档稀疏表示的兴趣，并保持一定的可解释性。稀疏表示还可以作为查询和文档的隐式或显式（潜在、上下文化）扩展机制。如果你是查询扩展的新手，可以向大师本人 Daniel Tunkelang 学习更多。

4. 稀疏模型学到了什么？

模型学习将其学习到的稠密表示投影到 MLM 头上，以给出词汇分布。这意味着模型可以进行自动词汇扩展。（图片由 pinecone 提供）

</details>

跳转到"如何在流行向量数据库中使用及更多" 或继续了解更多详情。

2. 动机：

SPLADE 模型在检索效果（质量）和检索效率（延迟和成本）之间取得了良好的平衡，考虑到这一点，我们做了非常小的检索效率调整，使其更适合工业场景。（纯 MLE 人员不应将效率与模型推理效率混淆。我们的主要关注点是检索效率。此后，除非明确限定，否则效率是检索效率的简写。并非推理效率不重要，我们将在后续解决这个问题。）

我们尝试和结果的 TL;DR（太长不看版）

FLOPS 调优：分别设置序列长度和严格限制的 FLOPs 调度及词汇预算 doc(128) & query(24) 而非官方 SPLADE++ 的 256。灵感来自 SparseEmbed
初始化权重：使用 MLM 损失进行中间训练的 bert-base-uncased。像官方 splade++ / ColBERT 一样具有一定的语料库感知能力
但仍达到了具有竞争力的效果，ID 数据上的 MRR@10 为 37.8（OOD 为 49.4），检索延迟为 48.81ms。（多线程）全部在消费级 GPU 上，每个查询仅使用 5 个负样本。
对于工业场景：自定义领域的效果需要的不仅仅是牺牲 FLOPS 换取微小收益，且"SPLADE++ 不适合单 CPU 检索"这一前提不成立。
由于查询时推理延迟，我们仍需要 2 个模型，一个用于查询，一个用于文档，这是一个文档模型，查询模型将很快发布。

注意：论文将表现最好的模型称为 SPLADE++，因此为保持一致性，我们继续使用相同的名称。

<br/>

3. 为什么 FLOPS 是工业场景的关键指标之一？

虽然只有对大样本的实证分析才有意义，但这里有一个抽查——一个定性的例子来给你一个概念。我们的模型在效果上与相比 SPLADE++ 模型（包括 SoTA）词汇量减少约 4% 和约 48% 达到同等竞争力。（我们将在下一节展示定量结果。）

所以，"如何击败 SoTA MRR？"从来不是我们的设计目标，而是"以什么代价可以达到可接受的效果，即 MRR@10"。随意降低 lambda 值（λQ、λD，见上表）会获得更好的 MRR。但较低的 lambda 值 = 较高的 FLOPS = 更多的词汇 = 较差的效率。这对工业场景来说是不可取的。

我们的模型

number of actual dimensions:  121
SPLADE BOW rep:
 [('stress', 2.42), ('thermal', 2.31), ('glass', 2.27), ('pan', 1.78), ('heat', 1.66), ('glasses', 1.58), ('crack', 1.42), ('anxiety', 1.36), ('break', 1.31), ('window', 0.91), ('heating', 0.84), ('hot', 0.82), ('adjacent', 0.82), ('hotter', 0.82), ('if', 0.75), ('cause', 0.7), ('caused', 0.7), ('create', 0.7), ('factors', 0.69), ('created', 0.68), ('cracks', 0.67), ('breaks', 0.67), ('area', 0.66), ('##glass', 0.66), ('cracked', 0.63), ('areas', 0.6), ('cracking', 0.59), ('windows', 0.58), ('effect', 0.56), ('causes', 0.56), ('ruin', 0.54), ('severe', 0.54), ('too', 0.53), ('flame', 0.5), ('collapse', 0.49), ('stresses', 0.49), ('or', 0.48), ('physics', 0.47), ('temperature', 0.46), ('get', 0.46), ('heated', 0.45), ('problem', 0.45), ('energy', 0.44), ('hottest', 0.42), ('phenomenon', 0.42), ('sweating', 0.41), ('insulation', 0.39), ('level', 0.39), ('warm', 0.39), ('governed', 0.38), ('formation', 0.37), ('failure', 0.35), ('frank', 0.34), ('cooling', 0.32), ('fracture', 0.31), ('because', 0.31), ('crystal', 0.31), ('determined', 0.31), ('boiler', 0.31), ('mechanical', 0.3), ('shatter', 0.29), ('friction', 0.29), ('levels', 0.29), ('cold', 0.29), ('will', 0.29), ('ceramics', 0.29), ('factor', 0.28), ('crash', 0.28), ('reaction', 0.28), ('fatigue', 0.28), ('hazard', 0.27), ('##e', 0.26), ('anger', 0.26), ('bubble', 0.25), ('process', 0.24), ('cleaning', 0.23), ('surrounding', 0.22), ('theory', 0.22), ('sash', 0.22), ('distraction', 0.21), ('adjoining', 0.19), ('environmental', 0.19), ('ross', 0.18), ('formed', 0.17), ('broken', 0.16), ('affect', 0.16), ('##pan', 0.15), ('graphic', 0.14), ('damage', 0.14), ('bubbles', 0.13), ('windshield', 0.13), ('temporal', 0.13), ('roof', 0.12), ('strain', 0.12), ('clear', 0.09), ('ceramic', 0.08), ('stressed', 0.08), ('##uation', 0.08), ('cool', 0.08), ('expand', 0.07), ('storm', 0.07), ('shock', 0.07), ('psychological', 0.06), ('breaking', 0.06), ('##es', 0.06), ('melting', 0.05), ('burst', 0.05), ('sensing', 0.04), ('heats', 0.04), ('error', 0.03), ('weather', 0.03), ('drink', 0.03), ('fire', 0.03), ('vibration', 0.02), ('induced', 0.02), ('warmer', 0.02), ('leak', 0.02), ('fog', 0.02), ('safety', 0.01), ('surface', 0.01), ('##thermal', 0.0)]

naver/splade-cocondenser-ensembledistil（SoTA，词汇量 + FLOPS 多约 4% = 1.85）

number of actual dimensions:  126
SPLADE BOW rep:
 [('stress', 2.25), ('glass', 2.23), ('thermal', 2.18), ('glasses', 1.65), ('pan', 1.62), ('heat', 1.56), ('stressed', 1.42), ('crack', 1.31), ('break', 1.12), ('cracked', 1.1), ('hot', 0.93), ('created', 0.9), ('factors', 0.81), ('broken', 0.73), ('caused', 0.71), ('too', 0.71), ('hotter', 0.65), ('governed', 0.61), ('heating', 0.59), ('temperature', 0.59), ('adjacent', 0.59), ('cause', 0.58), ('effect', 0.57), ('fracture', 0.56), ('bradford', 0.55), ('strain', 0.53), ('hammer', 0.51), ('brian', 0.48), ('error', 0.47), ('windows', 0.45), ('will', 0.45), ('reaction', 0.42), ('create', 0.42), ('windshield', 0.41), ('heated', 0.41), ('factor', 0.4), ('cracking', 0.39), ('failure', 0.38), ('mechanical', 0.38), ('when', 0.38), ('formed', 0.38), ('bolt', 0.38), ('mechanism', 0.37), ('warm', 0.37), ('areas', 0.36), ('area', 0.36), ('energy', 0.34), ('disorder', 0.33), ('barry', 0.33), ('shock', 0.32), ('determined', 0.32), ('gage', 0.32), ('sash', 0.31), ('theory', 0.31), ('level', 0.31), ('resistant', 0.31), ('brake', 0.3), ('window', 0.3), ('crash', 0.3), ('hazard', 0.29), ('##ink', 0.27), ('ceramic', 0.27), ('storm', 0.25), ('problem', 0.25), ('issue', 0.24), ('impact', 0.24), ('fridge', 0.24), ('injury', 0.23), ('ross', 0.22), ('causes', 0.22), ('affect', 0.21), ('pressure', 0.21), ('fatigue', 0.21), ('leak', 0.21), ('eye', 0.2), ('frank', 0.2), ('cool', 0.2), ('might', 0.19), ('gravity', 0.18), ('ray', 0.18), ('static', 0.18), ('collapse', 0.18), ('physics', 0.18), ('wave', 0.18), ('reflection', 0.17), ('parker', 0.17), ('strike', 0.17), ('hottest', 0.17), ('burst', 0.16), ('chance', 0.16), ('burn', 0.14), ('rubbing', 0.14), ('interference', 0.14), ('bailey', 0.13), ('vibration', 0.12), ('gilbert', 0.12), ('produced', 0.12), ('rock', 0.12), ('warmer', 0.11), ('get', 0.11), ('drink', 0.11), ('fireplace', 0.11), ('ruin', 0.1), ('brittle', 0.1), ('fragment', 0.1), ('stumble', 0.09), ('formation', 0.09), ('shatter', 0.08), ('great', 0.08), ('friction', 0.08), ('flash', 0.07), ('cracks', 0.07), ('levels', 0.07), ('smash', 0.04), ('fail', 0.04), ('fra', 0.04), ('##glass', 0.03), ('variables', 0.03), ('because', 0.02), ('knock', 0.02), ('sun', 0.02), ('crush', 0.01), ('##e', 0.01), ('anger', 0.01)]

naver/splade-v2-distil（词汇量 + FLOPS 多约 48% = 3.82）

number of actual dimensions:  234
SPLADE BOW rep:
 [('glass', 2.55), ('stress', 2.39), ('thermal', 2.38), ('glasses', 1.95), ('stressed', 1.87), ('crack', 1.84), ('cool', 1.78), ('heat', 1.62), ('pan', 1.6), ('break', 1.53), ('adjacent', 1.44), ('hotter', 1.43), ('strain', 1.21), ('area', 1.16), ('adjoining', 1.14), ('heated', 1.11), ('window', 1.07), ('stresses', 1.04), ('hot', 1.03), ('created', 1.03), ('create', 1.03), ('cause', 1.02), ('factors', 1.02), ('cooler', 1.01), ('broken', 1.0), ('too', 0.99), ('fracture', 0.96), ('collapse', 0.96), ('cracking', 0.95), ('great', 0.93), ('happen', 0.93), ('windows', 0.89), ('broke', 0.87), ('##e', 0.87), ('pressure', 0.84), ('hottest', 0.84), ('breaking', 0.83), ('govern', 0.79), ('shatter', 0.76), ('level', 0.75), ('heating', 0.69), ('temperature', 0.69), ('cracked', 0.69), ('panel', 0.68), ('##glass', 0.68), ('ceramic', 0.67), ('sash', 0.66), ('warm', 0.66), ('areas', 0.64), ('creating', 0.63), ('will', 0.62), ('tension', 0.61), ('cracks', 0.61), ('optical', 0.6), ('mechanism', 0.58), ('kelly', 0.58), ('determined', 0.58), ('generate', 0.58), ('causes', 0.56), ('if', 0.56), ('factor', 0.56), ('the', 0.56), ('chemical', 0.55), ('governed', 0.55), ('crystal', 0.55), ('strike', 0.55), ('microsoft', 0.54), ('creates', 0.53), ('than', 0.53), ('relation', 0.53), ('glazed', 0.52), ('compression', 0.51), ('painting', 0.51), ('governing', 0.5), ('harden', 0.49), ('solar', 0.48), ('reflection', 0.48), ('ic', 0.46), ('split', 0.45), ('mirror', 0.44), ('damage', 0.43), ('ring', 0.42), ('formation', 0.42), ('wall', 0.41), ('burst', 0.4), ('radiant', 0.4), ('determine', 0.4), ('one', 0.4), ('plastic', 0.39), ('furnace', 0.39), ('difference', 0.39), ('melt', 0.39), ('get', 0.39), ('contract', 0.38), ('forces', 0.38), ('gets', 0.38), ('produce', 0.38), ('surrounding', 0.37), ('vibration', 0.37), ('tile', 0.37), ('fail', 0.36), ('warmer', 0.36), ('rock', 0.35), ('fault', 0.35), ('roof', 0.34), ('burned', 0.34), ('physics', 0.33), ('welding', 0.33), ('why', 0.33), ('a', 0.32), ('pop', 0.32), ('and', 0.31), ('fra', 0.3), ('stat', 0.3), ('withstand', 0.3), ('sunglasses', 0.3), ('material', 0.29), ('ice', 0.29), ('generated', 0.29), ('matter', 0.29), ('frame', 0.28), ('elements', 0.28), ('then', 0.28), ('.', 0.28), ('pont', 0.28), ('blow', 0.28), ('snap', 0.27), ('metal', 0.26), ('effect', 0.26), ('reaction', 0.26), ('related', 0.25), ('aluminium', 0.25), ('neighboring', 0.25), ('weight', 0.25), ('steel', 0.25), ('bulb', 0.25), ('tear', 0.25), ('coating', 0.25), ('plumbing', 0.25), ('co', 0.25), ('microwave', 0.24), ('formed', 0.24), ('pipe', 0.23), ('drink', 0.23), ('chemistry', 0.23), ('energy', 0.22), ('reflect', 0.22), ('dynamic', 0.22), ('leak', 0.22), ('is', 0.22), ('lens', 0.21), ('frost', 0.21), ('lenses', 0.21), ('produced', 0.21), ('induced', 0.2), ('arise', 0.2), ('plate', 0.2), ('equations', 0.19), ('affect', 0.19), ('tired', 0.19), ('mirrors', 0.18), ('thickness', 0.18), ('bending', 0.18), ('cabinet', 0.17), ('apart', 0.17), ('##thermal', 0.17), ('gas', 0.17), ('equation', 0.17), ('relationship', 0.17), ('composition', 0.17), ('engineering', 0.17), ('block', 0.16), ('breaks', 0.16), ('when', 0.16), ('definition', 0.16), ('collapsed', 0.16), ('generation', 0.16), (',', 0.16), ('philips', 0.16), ('later', 0.15), ('wood', 0.15), ('neighbouring', 0.15), ('structural', 0.14), ('regulate', 0.14), ('neighbors', 0.13), ('lighting', 0.13), ('happens', 0.13), ('more', 0.13), ('property', 0.13), ('cooling', 0.12), ('shattering', 0.12), ('melting', 0.12), ('how', 0.11), ('cloud', 0.11), ('barriers', 0.11), ('lam', 0.11), ('conditions', 0.11), ('rule', 0.1), ('insulation', 0.1), ('bathroom', 0.09), ('convection', 0.09), ('cavity', 0.09), ('source', 0.08), ('properties', 0.08), ('bend', 0.08), ('bottles', 0.08), ('ceramics', 0.07), ('temper', 0.07), ('tense', 0.07), ('keller', 0.07), ('breakdown', 0.07), ('concrete', 0.07), ('simon', 0.07), ('solids', 0.06), ('windshield', 0.05), ('eye', 0.05), ('sunlight', 0.05), ('brittle', 0.03), ('caused', 0.03), ('suns', 0.03), ('floor', 0.02), ('components', 0.02), ('photo', 0.02), ('change', 0.02), ('sun', 0.01), ('crystals', 0.01), ('problem', 0.01), ('##proof', 0.01), ('parameters', 0.01), ('gases', 0.0), ('prism', 0.0), ('doing', 0.0), ('lattice', 0.0), ('ground', 0.0)]

注意 1：此特定段落被用作便于比较的示例

</details>

4. 这如何转化为实证指标？

我们的模型在词汇上是稀疏的，但仍然有效。这意味着更快的检索（用户体验）和更小的索引大小（成本）。标准 MS-MARCO 小型开发集上的平均检索时间和缩放的总 FLOPS 损失分别是下面的指标。这就是为什么 Google 的 SparseEmbed 很有趣，因为它们也以低得多的 FLOPs 实现了 SPLADE 质量的检索效果。与 ColBERT 相比，SPLADE 和 SparseEmbed 以线性复杂度匹配查询和文档词汇，而 ColBERT 的后期交互（即所有查询-文档词汇对）需要二次复杂度。SparseEmbed 的挑战在于它使用了一个名为 Top-k 的超参数来限制用于学习上下文化稠密表示的词汇数量。 比如查询和段落编码分别使用 64 和 256 个词汇。但目前尚不清楚这些超参数在其他领域或语言（词汇概念变化很大的地方，比如我们的母语泰米尔语，它具有粘着性）中的可迁移性如何。

注意：为什么使用 Anserini 而不是 PISA？ Anserini 是一个生产就绪的基于 lucene 的库。常见的工业搜索部署使用 Solr 或 elastic，它们都是基于 lucene 的，因此性能具有可比性。PISA 延迟与工业无关，因为它是一个仅用于研究的系统。完整的 anserini 评估日志将很快更新，编码、索引和查询详情在这里。

BEIR ZST OOD 性能：将在页面末尾添加。

我们的模型在更多方面有所不同

Cocondenser 权重：与最佳官方 SPLADE++ 或 SparseEmbed 不同，我们不从 Luyu/co-condenser* 模型初始化权重。但我们达到了 CoCondenser SPLADE 级别的性能。更多信息将在稍后提供。
相同大小的模型：官方 SPLADE++、SparseEmbed 和我们的模型都在相同大小的模型上进行微调。bert-base-uncased 的大小。 </details>

5. 工业适用性的路线图和未来方向。

提高效率：这是一个无底洞，将继续提高服务和检索效率。
自定义/领域微调：SPLADE 模型的 OOD 零样本性能很好，但在工业场景中并不重要，因为我们需要能够在自定义数据集或领域上进行微调的能力。在新数据集上微调 SPLADE 并不便宜，需要对查询和段落进行标注。因此，我们将继续研究如何在不需要昂贵标注的情况下，经济地在自定义数据集上微调我们的方案。
多语言 SPLADE：SPLADE 的训练成本（即 GPU 预算）与基础模型的词汇表大小成正比，因此使用 mbert 或 XLMR 的多语言 SPLADE 可能会很昂贵，因为它们有 120K 和 250K 的词汇表，而 bert-base-uncased 只有 30K。我们将继续研究如何最好地将我们的方案扩展到多语言世界。

6. 使用方法

为了在没有重型 Torch 依赖的情况下启用轻量级推理解决方案，我们还将发布一个库——SPLADERunner 当然，如果这不重要，你总是可以使用 Huggingface transformers 库来使用这些模型。

6a. 配合流行向量数据库

向量数据库	Colab 链接
Pinecone
Qdrant	待定

6b. 配合 SPLADERunner 库

SPLADERunner 库

pip install spladerunner

#一次性初始化
from spladerunner import Expander
# 默认模型是文档扩展器。
exapander = Expander()

#示例文档扩展
sparse_rep = expander.expand(
    ["The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science."])

6c. 配合 HuggingFace

NOTEBOOK 用户？先登录

!huggingface-cli login

集成到你的代码中？ 如何在代码中使用 HF tokens 进行这些更改

tokenizer = AutoTokenizer.from_pretrained('prithivida/Splade_PP_en_v1', token=<Your token>)
model = AutoModelForMaskedLM.from_pretrained('prithivida/Splade_PP_en_v1', token=<Your token>)

完整代码

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

device = "cuda:0" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained('prithivida/Splade_PP_en_v1')
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}
model = AutoModelForMaskedLM.from_pretrained('prithivida/Splade_PP_en_v1')
model.to(device)

sentence = """The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science."""

inputs = tokenizer(sentence, return_tensors='pt')
inputs = {key: val.to(device) for key, val in inputs.items()}
input_ids = inputs['input_ids']

attention_mask = inputs['attention_mask']

outputs = model(**inputs)

logits, attention_mask = outputs.logits, attention_mask
relu_log = torch.log(1 + torch.relu(logits))
weighted_log = relu_log * attention_mask.unsqueeze(-1)
max_val, _ = torch.max(weighted_log, dim=1)
vector = max_val.squeeze()


cols = vector.nonzero().squeeze().cpu().tolist()
print("number of actual dimensions: ", len(cols))
weights = vector[cols].cpu().tolist()

d = {k: v for k, v in zip(cols, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v,2)))

print("SPLADE BOW rep:\n", bow_rep)

BEIR 零样本 OOD 性能：

训练详情：

待定

致谢

感谢 Nils Reimers 的所有意见。
感谢 Anserini 库的作者们。

局限性和偏见

BERT 模型的所有局限性和偏见都适用于此微调工作。

引用

如果您使用我们的模型或库，请引用。引用信息如下。

Damodaran, P. (2024). Splade_PP_en_v2: Independent Implementation of SPLADE++ Model (`a.k.a splade-cocondenser* and family`) for the Industry setting. (Version 2.0.0) [Computer software].

goverlyai/Splade_PP_en_v2

作者 goverlyai

fill-mask transformers

↓ 0 ♥ 0

创建时间: 2025-02-20 03:59:15+00:00

更新时间: 2025-02-20 18:18:00+00:00

在 Hugging Face 上查看

文件 (15)

.gitattributes

.gitignore

ID.png

Metrics.png

README.md

config.json

expansion.png

generation_config.json

onnx/model.onnx ONNX

pytorch_model.bin

special_tokens_map.json

splade_v2.png

tokenizer.json

tokenizer_config.json

vocab.txt