说明文档

由 <a href="https://jina.ai/">Jina AI</a> 训练。

JinaColBERT V2：通用多语言延迟交互检索器。

JinaColBERT V2（jina-colbert-v2）是基于 JinaColBERT V1 的新模型，扩展了 jina-colbert-v1-en 模型的功能和性能。与之前的版本一样，它具有 Jina AI 的 8192 token 输入上下文，以及改进的效率、性能和可解释性，采用 token 级嵌入和延迟交互技术。

此新版本添加了新功能和性能改进：

支持数十种语言的多语言功能，在主要全球语言上表现优异。
Matryoshka 嵌入，允许用户在效率和精度之间灵活权衡。
与纯英语版本的 jina-colbert-v1-en 相比，检索性能更优。

JinaColBERT V2 提供三种不同版本以支持不同的嵌入维度： jinaai/jina-colbert-v2：128 维嵌入 jinaai/jina-colbert-v2-96：96 维嵌入 jinaai/jina-colbert-v2-64：64 维嵌入

使用方法

安装

jina-colbert-v2 使用 flash attention 训练，因此需要安装 einops 和 flash_attn。

您可以使用 Stanford ColBERT 库或使用我们提供的 pylate/ragatouille 包来使用该模型。

pip install -U einops flash_attn
pip install -U ragatouille # 或者
pip install -U col-ai # 或者
pip install -U pylate

PyLate

# 详细使用方法请参考 Pylate：https://github.com/lightonai/pylate
from pylate import indexes, models, retrieve

model = models.ColBERT(
    model_name_or_path="jinaai/jina-colbert-v2",
    query_prefix="[QueryMarker]",
    document_prefix="[DocumentMarker]",
    attend_to_expansion_tokens=True,
    trust_remote_code=True,
)

RAGatouille

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = [
    "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
    "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
RAG.index(docs, index_name="demo")
query = "What does ColBERT do?"
results = RAG.search(query)

Stanford ColBERT

from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint

ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = [
    "ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
    "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
query_vectors = ckpt.queryFromText(docs, bsize=2)

评估结果

检索基准测试

BEIR

NDCG@10	jina-colbert-v2	jina-colbert-v1	ColBERTv2.0	BM25
avg	0.531	0.502	0.496	0.440
nfcorpus	0.346	0.338	0.337	0.325
fiqa	0.408	0.368	0.354	0.236
trec-covid	0.834	0.750	0.726	0.656
arguana	0.366	0.494	0.465	0.315
quora	0.887	0.823	0.855	0.789
scidocs	0.186	0.169	0.154	0.158
scifact	0.678	0.701	0.689	0.665
webis-touche	0.274	0.270	0.260	0.367
dbpedia-entity	0.471	0.413	0.452	0.313
fever	0.805	0.795	0.785	0.753
climate-fever	0.239	0.196	0.176	0.213
hotpotqa	0.766	0.656	0.675	0.603
nq	0.640	0.549	0.524	0.329

MS MARCO 段落检索

MRR@10	jina-colbert-v2	jina-colbert-v1	ColBERTv2.0	BM25
MSMARCO	0.396	0.390	0.397	0.187

多语言基准测试

MIRACLE

NDCG@10	jina-colbert-v2	mDPR (零样本)
avg	0.627	0.427
ar	0.753	0.499
bn	0.750	0.443
de	0.504	0.490
es	0.538	0.478
en	0.570	0.394
fa	0.563	0.480
fi	0.740	0.472
fr	0.541	0.435
hi	0.600	0.383
id	0.547	0.272
ja	0.632	0.439
ko	0.671	0.419
ru	0.643	0.407
sw	0.499	0.299
te	0.742	0.356
th	0.772	0.358
yo	0.623	0.396
zh	0.523	0.512

mMARCO

MRR@10	jina-colbert-v2	BM-25	ColBERT-XM
avg	0.313	0.141	0.254
ar	0.272	0.111	0.195
de	0.331	0.136	0.270
nl	0.330	0.140	0.275
es	0.341	0.158	0.285
fr	0.335	0.155	0.269
hi	0.309	0.134	0.238
id	0.319	0.149	0.263
it	0.337	0.153	0.265
ja	0.276	0.141	0.241
pt	0.337	0.152	0.276
ru	0.298	0.124	0.251
vi	0.287	0.136	0.226
zh	0.302	0.116	0.246

Matryoshka 表示基准测试

BEIR

NDCG@10	dim=128	dim=96	dim=64
avg	0.599	0.591	0.589
nfcorpus	0.346	0.340	0.347
fiqa	0.408	0.404	0.404
trec-covid	0.834	0.808	0.805
hotpotqa	0.766	0.764	0.756
nq	0.640	0.640	0.635

MSMARCO

MRR@10	dim=128	dim=96	dim=64
msmarco	0.396	0.391	0.388

其他模型

此外，我们还提供以下嵌入模型，您也可以将它们用于检索：

jina-embeddings-v2-base-en：1.37 亿参数。
jina-embeddings-v2-base-zh：1.61 亿参数中英双语模型。
jina-embeddings-v2-base-de：1.61 亿参数德英双语模型。
jina-embeddings-v2-base-es：1.61 亿参数西英双语模型。
jina-reranker-v2：多语言重排序模型。
jina-clip-v1：英语多模态（文本-图像）嵌入模型。

联系方式

加入我们的 Discord 社区，与其他社区成员交流想法。

@inproceedings{xiao-etal-2024-jina,
    title = "{J}ina-{C}ol{BERT}-v2: A General-Purpose Multilingual Late Interaction Retriever",
    author = {Jha, Rohan  and
      Wang, Bo  and
      G{"u}nther, Michael  and
      Mastrapas, Georgios  and
      Sturua, Saba  and
      Mohr, Isabelle  and
      Koukounas, Andreas  and
      Wang, Mohammad Kalim  and
      Wang, Nan  and
      Xiao, Han},
    editor = {S{"a}lev{"a}, Jonne  and
      Owodunni, Abraham},
    booktitle = "Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.mrl-1.11/",
    doi = "10.18653/v1/2024.mrl-1.11",
    pages = "159--166",
    abstract = "Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT`s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model`s retrieval performance and cut storage requirements by up to 50{%}. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,"
}

jinaai/jina-colbert-v2

作者 jinaai

↓ 159.3K ♥ 142

创建时间: 2024-08-16 13:55:02+00:00

更新时间: 2025-01-17 08:20:26+00:00

在 Hugging Face 上查看

文件 (10)

.gitattributes

README.md

artifact.metadata

config.json

model.safetensors

onnx/model.onnx ONNX

onnx/model.onnx_data

special_tokens_map.json

tokenizer.json

tokenizer_config.json