返回模型

说明文档

Fork of https://huggingface.co/thenlper/gte-small with ONNX weights to be compatible with Transformers.js. See JavaScript usage.

gte-small

通用文本嵌入 (GTE) 模型。

GTE 模型由阿里巴巴达摩院训练。它们主要基于 BERT 框架，目前提供三种不同尺寸的模型，包括 GTE-large、GTE-base 和 GTE-small。GTE 模型在大规模的相关文本对语料库上进行训练，覆盖广泛的领域和场景。这使得 GTE 模型可以应用于各种文本嵌入的下游任务，包括信息检索、语义文本相似度、文本重排序等。

指标

GTE 模型的性能在 MTEB 基准测试中与其他流行的文本嵌入模型进行了比较。更详细的比较结果，请参阅 MTEB 排行榜。

模型名称	模型大小 (GB)	维度	序列长度	平均 (56)	聚类 (11)	成对分类 (3)	重排序 (4)	检索 (15)	STS (10)	摘要 (1)	分类 (12)
gte-large	0.67	1024	512	63.13	46.84	85.00	59.13	52.22	83.35	31.66	73.33
gte-base	0.22	768	512	62.39	46.2	84.57	58.61	51.14	82.3	31.17	73.01
e5-large-v2	1.34	1024	512	62.25	44.49	86.03	56.61	50.56	82.05	30.19	75.24
e5-base-v2	0.44	768	512	61.5	43.80	85.73	55.91	50.29	81.05	30.28	73.84
gte-small	0.07	384	512	61.36	44.89	83.54	57.7	49.46	82.07	30.42	72.31
text-embedding-ada-002	-	1536	8192	60.99	45.9	84.89	56.32	49.25	80.97	30.8	70.93
e5-small-v2	0.13	384	512	59.93	39.92	84.67	54.32	49.04	80.39	31.16	72.94
sentence-t5-xxl	9.73	768	512	59.51	43.72	85.06	56.42	42.24	82.63	30.08	73.42
all-mpnet-base-v2	0.44	768	514	57.78	43.69	83.04	59.36	43.81	80.28	27.49	65.07
sgpt-bloom-7b1-msmarco	28.27	4096	2048	57.59	38.93	81.9	55.65	48.22	77.74	33.6	66.19
all-MiniLM-L12-v2	0.13	384	512	56.53	41.81	82.41	58.44	42.69	79.8	27.9	63.21
all-MiniLM-L6-v2	0.09	384	512	56.26	42.35	82.37	58.04	41.95	78.9	30.81	63.05
contriever-base-msmarco	0.44	768	512	56.00	41.1	82.54	53.14	41.88	76.51	30.36	66.68
sentence-t5-base	0.22	768	512	55.27	40.21	85.18	53.09	33.63	81.14	31.39	69.81

用法

此模型可与 Python 和 JavaScript 一起使用。

Python

配合 Transformers 和 PyTorch 使用：

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    \"what is the capital of China?\",
    \"how to implement quick sort in python?\",
    \"Beijing\",
    \"sorting algorithms\"
]

tokenizer = AutoTokenizer.from_pretrained(\"Supabase/gte-small\")
model = AutoModel.from_pretrained(\"Supabase/gte-small\")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

配合 sentence-transformers 使用：

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('Supabase/gte-small')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

JavaScript

此模型可通过 Transformers.js 与 JavaScript 一起使用。

配合 Deno 或 Supabase Edge Functions 使用：

import { serve } from 'https://deno.land/std@0.168.0/http/server.ts'
import { env, pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0'

// Configuration for Deno runtime
env.useBrowserCache = false;
env.allowLocalModels = false;

const pipe = await pipeline(
  'feature-extraction',
  'Supabase/gte-small',
);

serve(async (req) => {
  // Extract input string from JSON body
  const { input } = await req.json();

  // Generate the embedding from the user input
  const output = await pipe(input, {
    pooling: 'mean',
    normalize: true,
  });

  // Extract the embedding output
  const embedding = Array.from(output.data);

  // Return the embedding
  return new Response(
    JSON.stringify({ embedding }),
    { headers: { 'Content-Type': 'application/json' } }
  );
});

在浏览器中使用（JavaScript 模块）：

<script type=\"module\">

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0';

const pipe = await pipeline(
  'feature-extraction',
  'Supabase/gte-small',
);

// Generate the embedding from text
const output = await pipe('Hello world', {
  pooling: 'mean',
  normalize: true,
});

// Extract the embedding output
const embedding = Array.from(output.data);

console.log(embedding);

</script>

在 Node.js 或 Web 打包工具（Webpack 等）中使用：

import { pipeline } from '@xenova/transformers';

const pipe = await pipeline(
  'feature-extraction',
  'Supabase/gte-small',
);

// Generate the embedding from text
const output = await pipe('Hello world', {
  pooling: 'mean',
  normalize: true,
});

// Extract the embedding output
const embedding = Array.from(output.data);

console.log(embedding);

限制

此模型仅适用于英文文本，任何超长文本将被截断为最多 512 个 token。

ariakang/gte-small

作者 ariakang

feature-extraction transformers.js

↓ 0 ♥ 0

创建时间: 2025-02-07 22:40:27+00:00

更新时间: 2025-02-07 22:41:06+00:00

在 Hugging Face 上查看

文件 (12)

.gitattributes

README.md

config.json

onnx/model.onnx ONNX

onnx/model_fp16.onnx ONNX

onnx/model_quantized.onnx ONNX

pytorch_model.bin

quantize_config.json

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.txt