说明文档
<br><br>
<p align="center"> <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: 您的搜索基础,超强升级!" width="150px"> </p>
<p align="center"> <b>由 <a href="https://jina.ai/"><b>Jina AI</b></a> 训练。</b> </p>
<p align="center"> <b>JinaColBERT V2:通用多语言延迟交互检索器。</b> </p>
JinaColBERT V2(jina-colbert-v2)是基于 JinaColBERT V1 的新模型,扩展了 jina-colbert-v1-en 模型的功能和性能。与之前的版本一样,它具有 Jina AI 的 8192 token 输入上下文,以及改进的效率、性能和可解释性,采用 token 级嵌入和延迟交互技术。
此新版本添加了新功能和性能改进:
- 支持数十种语言的多语言功能,在主要全球语言上表现优异。
- Matryoshka 嵌入,允许用户在效率和精度之间灵活权衡。
- 与纯英语版本的
jina-colbert-v1-en相比,检索性能更优。
JinaColBERT V2 提供三种不同版本以支持不同的嵌入维度:
jinaai/jina-colbert-v2:128 维嵌入
jinaai/jina-colbert-v2-96:96 维嵌入
jinaai/jina-colbert-v2-64:64 维嵌入
使用方法
安装
jina-colbert-v2 使用 flash attention 训练,因此需要安装 einops 和 flash_attn。
您可以使用 Stanford ColBERT 库或使用我们提供的 pylate/ragatouille 包来使用该模型。
pip install -U einops flash_attn
pip install -U ragatouille # 或者
pip install -U col-ai # 或者
pip install -U pylate
PyLate
# 详细使用方法请参考 Pylate:https://github.com/lightonai/pylate
from pylate import indexes, models, retrieve
model = models.ColBERT(
model_name_or_path="jinaai/jina-colbert-v2",
query_prefix="[QueryMarker]",
document_prefix="[DocumentMarker]",
attend_to_expansion_tokens=True,
trust_remote_code=True,
)
RAGatouille
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = [
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
RAG.index(docs, index_name="demo")
query = "What does ColBERT do?"
results = RAG.search(query)
Stanford ColBERT
from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint
ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = [
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
query_vectors = ckpt.queryFromText(docs, bsize=2)
评估结果
检索基准测试
BEIR
| NDCG@10 | jina-colbert-v2 | jina-colbert-v1 | ColBERTv2.0 | BM25 |
|---|---|---|---|---|
| avg | 0.531 | 0.502 | 0.496 | 0.440 |
| nfcorpus | 0.346 | 0.338 | 0.337 | 0.325 |
| fiqa | 0.408 | 0.368 | 0.354 | 0.236 |
| trec-covid | 0.834 | 0.750 | 0.726 | 0.656 |
| arguana | 0.366 | 0.494 | 0.465 | 0.315 |
| quora | 0.887 | 0.823 | 0.855 | 0.789 |
| scidocs | 0.186 | 0.169 | 0.154 | 0.158 |
| scifact | 0.678 | 0.701 | 0.689 | 0.665 |
| webis-touche | 0.274 | 0.270 | 0.260 | 0.367 |
| dbpedia-entity | 0.471 | 0.413 | 0.452 | 0.313 |
| fever | 0.805 | 0.795 | 0.785 | 0.753 |
| climate-fever | 0.239 | 0.196 | 0.176 | 0.213 |
| hotpotqa | 0.766 | 0.656 | 0.675 | 0.603 |
| nq | 0.640 | 0.549 | 0.524 | 0.329 |
MS MARCO 段落检索
| MRR@10 | jina-colbert-v2 | jina-colbert-v1 | ColBERTv2.0 | BM25 |
|---|---|---|---|---|
| MSMARCO | 0.396 | 0.390 | 0.397 | 0.187 |
多语言基准测试
MIRACLE
| NDCG@10 | jina-colbert-v2 | mDPR (零样本) |
|---|---|---|
| avg | 0.627 | 0.427 |
| ar | 0.753 | 0.499 |
| bn | 0.750 | 0.443 |
| de | 0.504 | 0.490 |
| es | 0.538 | 0.478 |
| en | 0.570 | 0.394 |
| fa | 0.563 | 0.480 |
| fi | 0.740 | 0.472 |
| fr | 0.541 | 0.435 |
| hi | 0.600 | 0.383 |
| id | 0.547 | 0.272 |
| ja | 0.632 | 0.439 |
| ko | 0.671 | 0.419 |
| ru | 0.643 | 0.407 |
| sw | 0.499 | 0.299 |
| te | 0.742 | 0.356 |
| th | 0.772 | 0.358 |
| yo | 0.623 | 0.396 |
| zh | 0.523 | 0.512 |
mMARCO
| MRR@10 | jina-colbert-v2 | BM-25 | ColBERT-XM |
|---|---|---|---|
| avg | 0.313 | 0.141 | 0.254 |
| ar | 0.272 | 0.111 | 0.195 |
| de | 0.331 | 0.136 | 0.270 |
| nl | 0.330 | 0.140 | 0.275 |
| es | 0.341 | 0.158 | 0.285 |
| fr | 0.335 | 0.155 | 0.269 |
| hi | 0.309 | 0.134 | 0.238 |
| id | 0.319 | 0.149 | 0.263 |
| it | 0.337 | 0.153 | 0.265 |
| ja | 0.276 | 0.141 | 0.241 |
| pt | 0.337 | 0.152 | 0.276 |
| ru | 0.298 | 0.124 | 0.251 |
| vi | 0.287 | 0.136 | 0.226 |
| zh | 0.302 | 0.116 | 0.246 |
Matryoshka 表示基准测试
BEIR
| NDCG@10 | dim=128 | dim=96 | dim=64 |
|---|---|---|---|
| avg | 0.599 | 0.591 | 0.589 |
| nfcorpus | 0.346 | 0.340 | 0.347 |
| fiqa | 0.408 | 0.404 | 0.404 |
| trec-covid | 0.834 | 0.808 | 0.805 |
| hotpotqa | 0.766 | 0.764 | 0.756 |
| nq | 0.640 | 0.640 | 0.635 |
MSMARCO
| MRR@10 | dim=128 | dim=96 | dim=64 |
|---|---|---|---|
| msmarco | 0.396 | 0.391 | 0.388 |
其他模型
此外,我们还提供以下嵌入模型,您也可以将它们用于检索:
jina-embeddings-v2-base-en:1.37 亿参数。jina-embeddings-v2-base-zh:1.61 亿参数中英双语模型。jina-embeddings-v2-base-de:1.61 亿参数德英双语模型。jina-embeddings-v2-base-es:1.61 亿参数西英双语模型。jina-reranker-v2:多语言重排序模型。jina-clip-v1:英语多模态(文本-图像)嵌入模型。
联系方式
加入我们的 Discord 社区,与其他社区成员交流想法。
@inproceedings{xiao-etal-2024-jina,
title = "{J}ina-{C}ol{BERT}-v2: A General-Purpose Multilingual Late Interaction Retriever",
author = {Jha, Rohan and
Wang, Bo and
G{"u}nther, Michael and
Mastrapas, Georgios and
Sturua, Saba and
Mohr, Isabelle and
Koukounas, Andreas and
Wang, Mohammad Kalim and
Wang, Nan and
Xiao, Han},
editor = {S{"a}lev{"a}, Jonne and
Owodunni, Abraham},
booktitle = "Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.mrl-1.11/",
doi = "10.18653/v1/2024.mrl-1.11",
pages = "159--166",
abstract = "Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT`s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model`s retrieval performance and cut storage requirements by up to 50{%}. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,"
}
jinaai/jina-colbert-v2
作者 jinaai
创建时间: 2024-08-16 13:55:02+00:00
更新时间: 2025-01-17 08:20:26+00:00
在 Hugging Face 上查看