ONNX 模型库
返回模型

说明文档

Granite-Embedding-30m-English (修订版 r1.1)

模型概述: Granite-Embedding-30m-English 是 Granite Embeddings 系列中一款拥有 3000 万参数的密集双编码器嵌入模型,可用于生成高质量的文本嵌入。该模型生成的嵌入向量维度为 384,采用开源相关性配对数据集(具有宽松的企业友好许可证)和 IBM 收集及生成的数据集进行训练。在保持 BEIR 等学术基准测试竞争力的同时,该模型在许多企业应用场景中也有良好表现。该模型采用检索导向的预训练、对比微调、知识蒸馏和模型合并等技术开发,以提高性能。

Granite-embedding-30m-r1.1 专门为支持多轮信息检索而设计,用于处理多轮会话信息检索中的上下文文档检索。Granite-embedding-30m-r1.1 基于多轮会话信息检索定制的数据进行训练,并使用多教师蒸馏技术对 granite-embedding-30m-english (https://huggingface.co/ibm-granite/granite-embedding-30m-english) 进行蒸馏。

支持的语言: 英语。

预期用途: 该模型旨在为给定文本生成固定长度的向量表示,可用于文本相似度、检索和搜索应用。

与 Sentence Transformers 配合使用: 该模型与 SentenceTransformer 库兼容,使用非常简单:

首先,安装 sentence transformers 库

pip install sentence_transformers

然后可以使用该模型对文本对进行编码,并找出它们表示之间的相似度。

Granite-Embedding-30m-English

from sentence_transformers import SentenceTransformer, util

model_path = "ibm-granite/granite-embedding-30m-english"
# 加载 Sentence Transformer 模型
model = SentenceTransformer(model_path)

input_queries = [
    ' Who made the song My achy breaky heart? ',
    'summit define'
    ]

input_passages = [
    "Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991. ",
    "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
    ]

# 对查询和段落进行编码
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)

# 计算余弦相似度
print(util.cos_sim(query_embeddings, passage_embeddings))

Granite-Embedding-30m-r1.1

具体而言,要使用 granite-embedding-30m-r1.1 对整个对话进行编码,需要将对话的最后用户查询作为输入,并按时间倒序排列对话实例:首先是最后的用户查询,其次是之前的代理响应,然后是更早的用户查询。例如:

对话:user:<user_query_1> agent: <agent_response_1> user:<user_query_2> agent: <agent_response_2> user:<user_query_3> agent: <agent_response_3> user:<last_user_query>

转换为输入查询格式:<last_user_query>[SEP]agent: <agent_response_3>||user:<user_query_3>||agent: <agent_response_2>||user:<user_query_2>||agent: <agent_response_1>||user:<user_query_1>

from sentence_transformers import SentenceTransformer, util

model_path = "ibm-granite/granite-embedding-30m-english"
# 加载 Sentence Transformer 模型
model = SentenceTransformer(model_path, revision="granite-embedding-30m-r1.1")

input_queries = [
    "Which team has won the most Super Bowls?[SEP]agent: Six teams from each conference (AFC and NFC), for a total of 12 team playoff system.||user: How many teams are in the NFL playoffs?||agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?",

    "How many teams are in the NFL playoffs?[SEP]agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?||agent: The Chicago Cardinals became the St. Louis Cardinals in 1960 and eventually moved and became the Arizona Cardinals. The Chicago Cardinals ( now the Arizona Cardinals ) were a founding member of the NFL.||user: Are the Arizona Cardinals and the Chicago Cardinals the same team?||agent: The Arizona Cardinals do play outside the United States. They had a game in London, England, on October 22, 2017, against the Los Angeles Rams at Twickenham Stadium and in 2005 they played in Mexico.||user: Do the Arizona Cardinals play outside the US?"
    ]

input_passages = [
    "Super Bowl\nThe Pittsburgh Steelers have won six Super Bowls , the most of any team ; the Dallas Cowboys , New England Patriots and San Francisco 49ers have five victories each , while the Green Bay Packers and New York Giants have four Super Bowl championships . Fourteen other NFL franchises have won at least one Super Bowl . Eight teams have appeared in Super Bowl games without a win . The Minnesota Vikings were the first team to have appeared a record four times without a win . The Buffalo Bills played in a record four Super Bowls in a row and lost every one . Four teams ( the Cleveland Browns , Detroit Lions , Jacksonville Jaguars , and Houston Texans ) have never appeared in a Super Bowl . The Browns and Lions both won NFL Championships prior to the creation of the Super Bowl , while the Jaguars ( 1995 ) and Texans ( 2002 ) are both recent NFL expansion teams . ( Detroit , Houston , and Jacksonville , however , have hosted a Super Bowl , leaving the Browns the only team to date who has neither played in nor whose city has hosted the game . ) The Minnesota Vikings won the last NFL Championship before the merger but lost to the AFL champion Kansas City Chiefs in Super Bowl IV.",

    "NFL playoffs \n The 32 - team National Football League is divided into two conferences , American Football Conference ( AFC ) and National Football Conference ( NFC ) , each with 16 teams . Since 2002 , each conference has been further divided into four divisions of four teams each . The tournament brackets are made up of six teams from each of the league 's two conferences , following the end of the regular season . Qualification into the playoffs works as follows : "
    ]

# 对查询和段落进行编码
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)

# 计算余弦相似度
print(util.cos_sim(query_embeddings, passage_embeddings))

与 Huggingface Transformers 配合使用: 以下是一个简单的示例,展示如何使用 Transformers 库和 PyTorch 来使用 Granite-Embedding-30m-English 模型。

首先,安装所需的库

pip install transformers torch

然后可以使用该模型对文本对进行编码

Granite-Embedding-30m-English

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "ibm-granite/granite-embedding-30m-english"

# 加载模型和分词器
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()

input_queries = [
    ' Who made the song My achy breaky heart? ',
    'summit define'
    ]

# 对输入进行分词
tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')

# 对查询进行编码
with torch.no_grad():
    # 查询
    model_output = model(**tokenized_queries)
    # 执行池化。granite-embedding-30m-english 使用 CLS 池化
    query_embeddings = model_output[0][:, 0]

# 对嵌入进行归一化
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)

Granite-Embedding-30m-r1.1

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "ibm-granite/granite-embedding-30m-english"

# 加载模型和分词器
model = AutoModel.from_pretrained(model_path, revision="granite-embedding-30m-r1.1")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()

input_queries = [
    "Which team has won the most Super Bowls?[SEP]agent: Six teams from each conference (AFC and NFC), for a total of 12 team playoff system.||user: How many teams are in the NFL playoffs?||agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?",

    "How many teams are in the NFL playoffs?[SEP]agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?||agent: The Chicago Cardinals became the St. Louis Cardinals in 1960 and eventually moved and became the Arizona Cardinals. The Chicago Cardinals ( now the Arizona Cardinals ) were a founding member of the NFL.||user: Are the Arizona Cardinals and the Chicago Cardinals the same team?||agent: The Arizona Cardinals do play outside the United States. They had a game in London, England, on October 22, 2017, against the Los Angeles Rams at Twickenham Stadium and in 2005 they played in Mexico.||user: Do the Arizona Cardinals play outside the US?"
    ]

# 对输入进行分词
tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')

# 对查询进行编码
with torch.no_grad():
    # 查询
    model_output = model(**tokenized_queries)
    # 执行池化。granite-embedding-30m-english-multiturn 使用 CLS 池化
    query_embeddings = model_output[0][:, 0]

# 对嵌入进行归一化
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)

评估:

Granite-Embedding-30M-English 模型比其他具有相似嵌入维度的模型快一倍,同时保持竞争力的性能。以下是 Granite-Embedding-30M-English 模型在 MTEB Retrieval(即 BEIR)和代码检索(CoIR)基准测试上的表现。

模型 参数 (M) 嵌入维度 MTEB Retrieval (15) CoIR (10)
granite-embedding-30m-english 30 384 49.1 47.0

granite-embedding-30m-r1.1 修订版保持了 granite-embedding-30m-english 的快速速度,同时在多轮信息检索基准测试上表现出色。以下是 granite-embedding-30M-r1.1 模型在 MTEB Retrieval(即 BEIR)和多轮信息检索(MTRAG(https://github.com/IBM/mt-rag-benchmark)、Multidoc2dial(https://github.com/IBM/multidoc2dial))数据集上的表现。

模型 参数 (M) 嵌入维度 MTEB Retrieval (15) MT-RAG Mdoc2dial
granite-embedding-30m-english 30 384 49.1 49.16 85.42
granite-embedding-30m-english-r1.1 30 384 48.9 52.33 85.78
bge-small-en-v1.5 33 512 53.86 38.26 83.71
e5-small-v2 33 384 48.46 28.72 75.7

模型架构: granite-embedding-30m-English 基于类似 RoBERTa 的纯编码器 Transformer 架构,在 IBM Research 内部训练。granite-embedding-30m-r1.1 与 granite-embedding-30m-English 采用相同的架构。

模型 granite-embedding-30m-english granite-embedding-125m-english granite-embedding-107m-multilingual granite-embedding-278m-multilingual
嵌入维度大小 384 768 384 768
层数 6 12 6 12
注意力头数量 12 12 12 12
中间层大小 1536 3072 1536 3072
激活函数 GeLU GeLU GeLU GeLU
词汇表大小 50265 50265 250002 250002
最大序列长度 512 512 512 512
参数数量 30M 125M 107M 278M

训练数据: 总体而言,训练数据主要来自四个关键来源:(1)从网页抓取的无监督标题-正文配对数据,(2)具有宽松、企业友好许可证的公开配对数据,(3)针对特定技术领域的 IBM 内部配对数据,以及(4)IBM 生成的合成数据。数据详情如下:

数据集 配对数量
SPECTER 引用三元组 684,100
Stack Exchange 重复问题(标题) 304,525
Stack Exchange 重复问题(正文) 250,519
Stack Exchange 重复问题(标题+正文) 250,460
Natural Questions (NQ) 100,231
SQuAD2.0 87,599
PAQ(问题、答案)配对 64,371,441
Stack Exchange(标题、答案)配对 4,067,139
Stack Exchange(标题、正文)配对 23,978,013
Stack Exchange(标题+正文、答案)配对 187,195
S2ORC 引用配对(标题) 52,603,982
S2ORC(标题、摘要) 41,769,185
S2ORC(引用、摘要) 52,603,982
WikiAnswers 重复问题配对 77,427,422
SearchQA 582,261
HotpotQA 85,000
Fever 109,810
Arxiv 2,358,545
Wikipedia 20,745,403
PubMed 20,000,000
Miracl En 配对 9,016
DBPedia 标题-正文配对 4,635,922
合成数据:查询-维基百科段落 1,879,093
合成数据:事实验证 9,888
IBM 内部三元组 40,290
IBM 内部标题-正文配对 1,524,586
MultiDoc2Dial 训练集(多轮对话) 21,451
IBM 内部合成数据 19,533

值得注意的是,我们没有在训练语料库中使用流行的 MS-MARCO 检索数据集,因为它采用非商业许可证,而其他开源模型由于其高质量而使用该数据集进行训练。

基础设施: 我们使用 IBM 的计算集群 Cognitive Compute Cluster 来训练 Granite Embedding 模型,该集群配备 NVIDIA A100 80gb GPU。该集群为我们提供了可扩展且高效的基础设施,支持在多个 GPU 上训练模型。

伦理考量和局限性: 用于训练基础语言模型的数据经过过滤,移除了包含仇恨、虐待和脏话的文本。Granite-embedding-30m-english 和 Granite-embedding-30m-r1.1 仅针对英语文本进行训练,上下文长度为 512 个标记(超过此长度的文本将被截断)。

资源

  • ⭐️ 了解 Granite 的最新更新:https://www.ibm.com/granite
  • 📄 开始使用教程、最佳实践和提示工程建议:https://www.ibm.com/granite/docs/
  • 💡 了解最新的 Granite 学习资源:https://ibm.biz/granite-learning-resources

ibm-granite/granite-embedding-30m-english

作者 ibm-granite

sentence-similarity sentence-transformers
↓ 93.2K ♥ 29

创建时间: 2024-12-04 20:28:47+00:00

更新时间: 2025-08-22 15:49:49+00:00

在 Hugging Face 上查看

文件 (14)

.gitattributes
1_Pooling/config.json
README.md
config.json
merges.txt
model.onnx ONNX
model.safetensors
modules.json
pytorch_model.bin
sentence_bert_config.json
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json