返回模型
说明文档
MiniCOIL v1
MiniCOIL 是一个用于文本检索的稀疏神经嵌入模型。
它为每个词干生成 4 维嵌入向量,捕捉词义的含义。 这些语义嵌入被组合成输入文本的词袋(BoW)表示。 最终稀疏表示通过使用 BM25 评分公式对每个词进行加权计算得出。
<img src="https://storage.googleapis.com/qdrant-examples/miniCOIL_inference.png" alt="miniCOIL inference" width="600"/>
如果某个词不在 miniCOIL 词汇表中,该词在稀疏表示中的权重将完全基于 BM25 评分。
在 这篇文章 中了解更多关于 miniCOIL 的信息。
使用方法
该模型设计用于与 FastEmbed 库配合使用。
注意: 该模型是针对 Qdrant 的特性设计的;在 Qdrant 中使用 miniCOIL 稀疏向量需要配置 Modifier.IDF。否则,您需要自行计算并按 BM25 公式的 IDF 部分对生成的稀疏表示进行缩放。
from fastembed import SparseTextEmbedding
model = SparseTextEmbedding(model_name="Qdrant/minicoil-v1")
documents = [
"fruit bat",
"baseball bat",
]
embeddings = list(model.embed(documents))
query_embedding = list(model.query_embed("bat in a cave"))
# embeddings[0] - "fruit bat"
# SparseEmbedding(values=array([-1.2509683 , -0.9510568 , -0.55398935, 0.188206 , 1.0497165 ,
# 0.31841373, -0.82047373, -0.9671025 ], dtype=float32), indices=array([ 8992, 8993, 8994, 8995, 18832, 18833, 18834, 18835],
# dtype=int32)) # 8992, 8993, 8994, 8995 - 4D "fruit" representation, 18832, 18833, 18834, 18835 - 4D "bat" representation
# embeddings[1] - "baseball bat"
#SparseEmbedding(values=array([ 1.1004512 , -0.5959816 , 0.23380531, -1.0912857 , 1.6768292 ],
# dtype=float32), indices=array([ 18832, 18833, 18834, 18835, 2068153269],
# dtype=int32)) # 18832, 18833, 18834, 18835 - 4D "bat" representation, 2068153269 - 1D "baseball" representation, as "baseball" is not in miniCOIL_v1 vocabulary, so we fall back to Qdrant/bm25 1D score
# query_embedding - "bat in a cave"
#[SparseEmbedding(values=array([ 0.5656684 , 0.395691 , -0.48945513, -0.5328054 , -0.5889519 ,
# 0.55871224, 0.27323055, 0.5160634 ], dtype=float32), indices=array([18832, 18833, 18834, 18835, 18920, 18921, 18922, 18923],
# dtype=int32))] # 18832, 18833, 18834, 18835 - 4D "bat" representation, 18920, 18921, 18922, 18923 - 4D "cave" representation, "in"/"a" - removed stop words
bat_1 = embeddings[0].values[4:8]
bat_2 = embeddings[1].values[:4]
bat_query = query_embedding[0].values[:4]
dot_product_1 = (bat_1 * bat_query).sum() #np.float32(1.6366475) measuring dot product between matching indices of sparse vectors
dot_product_2 = (bat_2 * bat_query).sum() #np.float32(0.8536716) measuring dot product between matching indices of sparse vectors
#1.6366475 > 0.8536716, as "bat" in "fruit bat" is more semantically similar to "bat" in "bat in a cave", as "bat" in "baseball bat"
Qdrant/minicoil-v1
作者 Qdrant
sentence-similarity
↓ 28.7K
♥ 27
创建时间: 2025-05-10 12:27:45+00:00
更新时间: 2025-06-26 09:38:32+00:00
在 Hugging Face 上查看文件 (19)
.gitattributes
README.md
config.json
minicoil.triplet.model.npy
minicoil.triplet.model.vocab
onnx/model.onnx
ONNX
onnx/model_bnb4.onnx
ONNX
onnx/model_fp16.onnx
ONNX
onnx/model_int8.onnx
ONNX
onnx/model_q4.onnx
ONNX
onnx/model_q4f16.onnx
ONNX
onnx/model_quantized.onnx
ONNX
onnx/model_uint8.onnx
ONNX
quantize_config.json
special_tokens_map.json
stopwords.txt
tokenizer.json
tokenizer_config.json
vocab.txt