说明文档
Granite-Embedding-30m-English (修订版 r1.1)
模型概述: Granite-Embedding-30m-English 是 Granite Embeddings 系列中一款拥有 3000 万参数的密集双编码器嵌入模型,可用于生成高质量的文本嵌入。该模型生成的嵌入向量维度为 384,采用开源相关性配对数据集(具有宽松的企业友好许可证)和 IBM 收集及生成的数据集进行训练。在保持 BEIR 等学术基准测试竞争力的同时,该模型在许多企业应用场景中也有良好表现。该模型采用检索导向的预训练、对比微调、知识蒸馏和模型合并等技术开发,以提高性能。
Granite-embedding-30m-r1.1 专门为支持多轮信息检索而设计,用于处理多轮会话信息检索中的上下文文档检索。Granite-embedding-30m-r1.1 基于多轮会话信息检索定制的数据进行训练,并使用多教师蒸馏技术对 granite-embedding-30m-english (https://huggingface.co/ibm-granite/granite-embedding-30m-english) 进行蒸馏。
- 开发者: Granite Embedding Team, IBM
- GitHub 仓库: ibm-granite/granite-embedding-models
- 网站: Granite Docs
- 论文: Technical Report
- 发布日期: 2025年8月29日(granite-embedding-30m-english-r1.1)
- 许可证: Apache 2.0
支持的语言: 英语。
预期用途: 该模型旨在为给定文本生成固定长度的向量表示,可用于文本相似度、检索和搜索应用。
与 Sentence Transformers 配合使用: 该模型与 SentenceTransformer 库兼容,使用非常简单:
首先,安装 sentence transformers 库
pip install sentence_transformers
然后可以使用该模型对文本对进行编码,并找出它们表示之间的相似度。
Granite-Embedding-30m-English
from sentence_transformers import SentenceTransformer, util
model_path = "ibm-granite/granite-embedding-30m-english"
# 加载 Sentence Transformer 模型
model = SentenceTransformer(model_path)
input_queries = [
' Who made the song My achy breaky heart? ',
'summit define'
]
input_passages = [
"Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991. ",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]
# 对查询和段落进行编码
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)
# 计算余弦相似度
print(util.cos_sim(query_embeddings, passage_embeddings))
Granite-Embedding-30m-r1.1
具体而言,要使用 granite-embedding-30m-r1.1 对整个对话进行编码,需要将对话的最后用户查询作为输入,并按时间倒序排列对话实例:首先是最后的用户查询,其次是之前的代理响应,然后是更早的用户查询。例如:
对话:user:<user_query_1> agent: <agent_response_1> user:<user_query_2> agent: <agent_response_2> user:<user_query_3> agent: <agent_response_3> user:<last_user_query>
转换为输入查询格式:<last_user_query>[SEP]agent: <agent_response_3>||user:<user_query_3>||agent: <agent_response_2>||user:<user_query_2>||agent: <agent_response_1>||user:<user_query_1>
from sentence_transformers import SentenceTransformer, util
model_path = "ibm-granite/granite-embedding-30m-english"
# 加载 Sentence Transformer 模型
model = SentenceTransformer(model_path, revision="granite-embedding-30m-r1.1")
input_queries = [
"Which team has won the most Super Bowls?[SEP]agent: Six teams from each conference (AFC and NFC), for a total of 12 team playoff system.||user: How many teams are in the NFL playoffs?||agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?",
"How many teams are in the NFL playoffs?[SEP]agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?||agent: The Chicago Cardinals became the St. Louis Cardinals in 1960 and eventually moved and became the Arizona Cardinals. The Chicago Cardinals ( now the Arizona Cardinals ) were a founding member of the NFL.||user: Are the Arizona Cardinals and the Chicago Cardinals the same team?||agent: The Arizona Cardinals do play outside the United States. They had a game in London, England, on October 22, 2017, against the Los Angeles Rams at Twickenham Stadium and in 2005 they played in Mexico.||user: Do the Arizona Cardinals play outside the US?"
]
input_passages = [
"Super Bowl\nThe Pittsburgh Steelers have won six Super Bowls , the most of any team ; the Dallas Cowboys , New England Patriots and San Francisco 49ers have five victories each , while the Green Bay Packers and New York Giants have four Super Bowl championships . Fourteen other NFL franchises have won at least one Super Bowl . Eight teams have appeared in Super Bowl games without a win . The Minnesota Vikings were the first team to have appeared a record four times without a win . The Buffalo Bills played in a record four Super Bowls in a row and lost every one . Four teams ( the Cleveland Browns , Detroit Lions , Jacksonville Jaguars , and Houston Texans ) have never appeared in a Super Bowl . The Browns and Lions both won NFL Championships prior to the creation of the Super Bowl , while the Jaguars ( 1995 ) and Texans ( 2002 ) are both recent NFL expansion teams . ( Detroit , Houston , and Jacksonville , however , have hosted a Super Bowl , leaving the Browns the only team to date who has neither played in nor whose city has hosted the game . ) The Minnesota Vikings won the last NFL Championship before the merger but lost to the AFL champion Kansas City Chiefs in Super Bowl IV.",
"NFL playoffs \n The 32 - team National Football League is divided into two conferences , American Football Conference ( AFC ) and National Football Conference ( NFC ) , each with 16 teams . Since 2002 , each conference has been further divided into four divisions of four teams each . The tournament brackets are made up of six teams from each of the league 's two conferences , following the end of the regular season . Qualification into the playoffs works as follows : "
]
# 对查询和段落进行编码
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)
# 计算余弦相似度
print(util.cos_sim(query_embeddings, passage_embeddings))
与 Huggingface Transformers 配合使用: 以下是一个简单的示例,展示如何使用 Transformers 库和 PyTorch 来使用 Granite-Embedding-30m-English 模型。
首先,安装所需的库
pip install transformers torch
然后可以使用该模型对文本对进行编码
Granite-Embedding-30m-English
import torch
from transformers import AutoModel, AutoTokenizer
model_path = "ibm-granite/granite-embedding-30m-english"
# 加载模型和分词器
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()
input_queries = [
' Who made the song My achy breaky heart? ',
'summit define'
]
# 对输入进行分词
tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')
# 对查询进行编码
with torch.no_grad():
# 查询
model_output = model(**tokenized_queries)
# 执行池化。granite-embedding-30m-english 使用 CLS 池化
query_embeddings = model_output[0][:, 0]
# 对嵌入进行归一化
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
Granite-Embedding-30m-r1.1
import torch
from transformers import AutoModel, AutoTokenizer
model_path = "ibm-granite/granite-embedding-30m-english"
# 加载模型和分词器
model = AutoModel.from_pretrained(model_path, revision="granite-embedding-30m-r1.1")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()
input_queries = [
"Which team has won the most Super Bowls?[SEP]agent: Six teams from each conference (AFC and NFC), for a total of 12 team playoff system.||user: How many teams are in the NFL playoffs?||agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?",
"How many teams are in the NFL playoffs?[SEP]agent: There are 32 teams in the National Football League (NFL).||user: How many teams are in the NFL?||agent: The Chicago Cardinals became the St. Louis Cardinals in 1960 and eventually moved and became the Arizona Cardinals. The Chicago Cardinals ( now the Arizona Cardinals ) were a founding member of the NFL.||user: Are the Arizona Cardinals and the Chicago Cardinals the same team?||agent: The Arizona Cardinals do play outside the United States. They had a game in London, England, on October 22, 2017, against the Los Angeles Rams at Twickenham Stadium and in 2005 they played in Mexico.||user: Do the Arizona Cardinals play outside the US?"
]
# 对输入进行分词
tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')
# 对查询进行编码
with torch.no_grad():
# 查询
model_output = model(**tokenized_queries)
# 执行池化。granite-embedding-30m-english-multiturn 使用 CLS 池化
query_embeddings = model_output[0][:, 0]
# 对嵌入进行归一化
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
评估:
Granite-Embedding-30M-English 模型比其他具有相似嵌入维度的模型快一倍,同时保持竞争力的性能。以下是 Granite-Embedding-30M-English 模型在 MTEB Retrieval(即 BEIR)和代码检索(CoIR)基准测试上的表现。
| 模型 | 参数 (M) | 嵌入维度 | MTEB Retrieval (15) | CoIR (10) |
|---|---|---|---|---|
| granite-embedding-30m-english | 30 | 384 | 49.1 | 47.0 |
granite-embedding-30m-r1.1 修订版保持了 granite-embedding-30m-english 的快速速度,同时在多轮信息检索基准测试上表现出色。以下是 granite-embedding-30M-r1.1 模型在 MTEB Retrieval(即 BEIR)和多轮信息检索(MTRAG(https://github.com/IBM/mt-rag-benchmark)、Multidoc2dial(https://github.com/IBM/multidoc2dial))数据集上的表现。
| 模型 | 参数 (M) | 嵌入维度 | MTEB Retrieval (15) | MT-RAG | Mdoc2dial |
|---|---|---|---|---|---|
| granite-embedding-30m-english | 30 | 384 | 49.1 | 49.16 | 85.42 |
| granite-embedding-30m-english-r1.1 | 30 | 384 | 48.9 | 52.33 | 85.78 |
| bge-small-en-v1.5 | 33 | 512 | 53.86 | 38.26 | 83.71 |
| e5-small-v2 | 33 | 384 | 48.46 | 28.72 | 75.7 |
模型架构: granite-embedding-30m-English 基于类似 RoBERTa 的纯编码器 Transformer 架构,在 IBM Research 内部训练。granite-embedding-30m-r1.1 与 granite-embedding-30m-English 采用相同的架构。
| 模型 | granite-embedding-30m-english | granite-embedding-125m-english | granite-embedding-107m-multilingual | granite-embedding-278m-multilingual |
|---|---|---|---|---|
| 嵌入维度大小 | 384 | 768 | 384 | 768 |
| 层数 | 6 | 12 | 6 | 12 |
| 注意力头数量 | 12 | 12 | 12 | 12 |
| 中间层大小 | 1536 | 3072 | 1536 | 3072 |
| 激活函数 | GeLU | GeLU | GeLU | GeLU |
| 词汇表大小 | 50265 | 50265 | 250002 | 250002 |
| 最大序列长度 | 512 | 512 | 512 | 512 |
| 参数数量 | 30M | 125M | 107M | 278M |
训练数据: 总体而言,训练数据主要来自四个关键来源:(1)从网页抓取的无监督标题-正文配对数据,(2)具有宽松、企业友好许可证的公开配对数据,(3)针对特定技术领域的 IBM 内部配对数据,以及(4)IBM 生成的合成数据。数据详情如下:
| 数据集 | 配对数量 |
|---|---|
| SPECTER 引用三元组 | 684,100 |
| Stack Exchange 重复问题(标题) | 304,525 |
| Stack Exchange 重复问题(正文) | 250,519 |
| Stack Exchange 重复问题(标题+正文) | 250,460 |
| Natural Questions (NQ) | 100,231 |
| SQuAD2.0 | 87,599 |
| PAQ(问题、答案)配对 | 64,371,441 |
| Stack Exchange(标题、答案)配对 | 4,067,139 |
| Stack Exchange(标题、正文)配对 | 23,978,013 |
| Stack Exchange(标题+正文、答案)配对 | 187,195 |
| S2ORC 引用配对(标题) | 52,603,982 |
| S2ORC(标题、摘要) | 41,769,185 |
| S2ORC(引用、摘要) | 52,603,982 |
| WikiAnswers 重复问题配对 | 77,427,422 |
| SearchQA | 582,261 |
| HotpotQA | 85,000 |
| Fever | 109,810 |
| Arxiv | 2,358,545 |
| Wikipedia | 20,745,403 |
| PubMed | 20,000,000 |
| Miracl En 配对 | 9,016 |
| DBPedia 标题-正文配对 | 4,635,922 |
| 合成数据:查询-维基百科段落 | 1,879,093 |
| 合成数据:事实验证 | 9,888 |
| IBM 内部三元组 | 40,290 |
| IBM 内部标题-正文配对 | 1,524,586 |
| MultiDoc2Dial 训练集(多轮对话) | 21,451 |
| IBM 内部合成数据 | 19,533 |
值得注意的是,我们没有在训练语料库中使用流行的 MS-MARCO 检索数据集,因为它采用非商业许可证,而其他开源模型由于其高质量而使用该数据集进行训练。
基础设施: 我们使用 IBM 的计算集群 Cognitive Compute Cluster 来训练 Granite Embedding 模型,该集群配备 NVIDIA A100 80gb GPU。该集群为我们提供了可扩展且高效的基础设施,支持在多个 GPU 上训练模型。
伦理考量和局限性: 用于训练基础语言模型的数据经过过滤,移除了包含仇恨、虐待和脏话的文本。Granite-embedding-30m-english 和 Granite-embedding-30m-r1.1 仅针对英语文本进行训练,上下文长度为 512 个标记(超过此长度的文本将被截断)。
资源
- ⭐️ 了解 Granite 的最新更新:https://www.ibm.com/granite
- 📄 开始使用教程、最佳实践和提示工程建议:https://www.ibm.com/granite/docs/
- 💡 了解最新的 Granite 学习资源:https://ibm.biz/granite-learning-resources
ibm-granite/granite-embedding-30m-english
作者 ibm-granite
创建时间: 2024-12-04 20:28:47+00:00
更新时间: 2025-08-22 15:49:49+00:00
在 Hugging Face 上查看