说明文档

multilingual-e5-small ONNX for pg_onnx

intfloat/multilingual-e5-small 的 ONNX 转换版本，针对 pgvector 和 pg_onnx 进行了优化。

本仓库支持在 PostgreSQL 内部进行原生语义搜索和嵌入推理，使用 ONNXRuntime 和 pg_onnx。

包含以下文件：

encoder.onnx：句子嵌入编码器
text_to_embedding.onnx：用于直接进行文本到嵌入推理的包装模型
tokenizer.onnx：ONNX 转换的分词器，用于快速、可移植的预处理

文件

文件	用途	大小
`encoder.onnx`	句子嵌入编码器	~470 MB
`text_to_embedding.onnx`	直接文本到嵌入推理的包装器	~475 MB
`tokenizer.onnx`	ONNX 序列化分词器	~5 MB

转换详情

该模型从原始的 multilingual-e5-small 转换而来，使用了：

Hugging Face Transformers 用于加载模型和分词器
通过 transformers.onnx 和自定义脚本进行 ONNX 导出
使用 tokenizers 和自定义转换逻辑将分词器序列化为 ONNX

所有组件都设计为通过 pg_onnx 在 PostgreSQL 中原生运行，从而可以在数据库中直接进行高效的语义搜索和嵌入推理。

从 Hugging Face 下载

from huggingface_hub import hf_hub_download

# 如果您分叉或重命名，请调整 repo_id
repo_id = "oga5/multilingual-e5-small-pg-onnx"

enc_path = hf_hub_download(repo_id=repo_id, filename="encoder.onnx")
tte_path = hf_hub_download(repo_id=repo_id, filename="text_to_embedding.onnx")
tok_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.onnx")

print(enc_path, tte_path, tok_path)

设置

所需库：

onnxruntime
onnxruntime-extensions（使用 -DOCOS_ENABLE_SPM_TOKENIZER=ON 构建）
pg_onnx（v1.23.1b 或更高版本）

-- 构建 pg_onnx 后，安装 pg_onnx 扩展

create extension pg_onnx

使用方法


-- 注册模型
select pg_onnx_import_model('e5-tok', 'v1', pg_read_binary_file('/PATH/tokenizer.onnx')::bytea, '{"ortextensions_path": "libortextensions.so"}'::jsonb, 'e5 tokenizer');
select pg_onnx_import_model('e5-embedding', 'v1', pg_read_binary_file('/PATH/text_to_embedding.onnx')::bytea, '{"ortextensions_path": "libortextensions.so"}'::jsonb, 'e5 text to embedding');

-- 创建函数
create or replace function e5_tok(input_text text)
returns integer[]
AS $$
    SELECT array_agg(value::int)
    FROM jsonb_array_elements_text(
        pg_onnx_execute_session(
            'e5-tok',
            'v1',
            jsonb_build_object('inputs', jsonb_build_array(input_text))
        ) -> 'tokens'
    );
$$
language sql
immutable;

create or replace function e5_embedding(input_text text)
returns vector(384)
as $$
    select array(
        select jsonb_array_elements_text(
            pg_onnx_execute_session(
                'e5-embedding',
                'v1',
                jsonb_build_object('text', jsonb_build_array(input_text))
            )->'embedding'->0
        )::float
    )::vector(384);
$$
language sql
immutable;

create or replace function e5_embedding_passage(input_text text)
returns vector
AS $$
    select e5_embedding('passage: ' || input_text);
$$
language 'sql'
immutable;

create or replace function e5_embedding_query(input_text text)
returns vector
AS $$
    select e5_embedding('query ' || input_text);
$$
language 'sql'
immutable;

-- 创建示例数据
create table llm_test (
    i integer not null primary key,
    txt text,
    v vector(384)
);

create index llm_test_v_idx on llm_test using hnsw (v vector_ip_ops);

insert into llm_test (i,txt) values ('1','Machine learning is a subfield of artificial intelligence');
insert into llm_test (i,txt) values ('2','A database is a system for managing data');
insert into llm_test (i,txt) values ('3','PostgreSQL is a powerful open-source database');
insert into llm_test (i,txt) values ('4','Vector search retrieves results by computing similarity');
insert into llm_test (i,txt) values ('5','ONNX is a standard format for machine learning models');
insert into llm_test (i,txt) values ('6','Natural language processing is a technology for handling text');
insert into llm_test (i,txt) values ('7','Embeddings convert text into vectors');
insert into llm_test (i,txt) values ('8','Cosine similarity measures similarity between vectors');
insert into llm_test (i,txt) values ('9','A tokenizer splits text into tokens');
insert into llm_test (i,txt) values ('10','Transformers are a modern neural network architecture');
insert into llm_test (i,txt) values ('11','SQL is a language for manipulating databases');
insert into llm_test (i,txt) values ('12','Indexes improve query performance');
insert into llm_test (i,txt) values ('13','pgvector is a vector extension for PostgreSQL');
insert into llm_test (i,txt) values ('14','Semantic search retrieves based on meaning');
insert into llm_test (i,txt) values ('15','Neural networks mimic the structure of the brain');
insert into llm_test (i,txt) values ('16','Deep learning uses multi-layer neural networks');
insert into llm_test (i,txt) values ('17','Batch processing handles multiple data at once');
insert into llm_test (i,txt) values ('18','Model inference performs prediction with a trained model');
insert into llm_test (i,txt) values ('19','Fine-tuning adapts an existing model to a specific task');
insert into llm_test (i,txt) values ('20','A cross-encoder evaluates the relevance between two texts');

-- 注册嵌入
update llm_test set v = e5_embedding_passage(txt);

-- 搜索
with q as (
    select 'What is machine learning?' as query
),
qv as materialized (
    select e5_embedding_query(q.query) as v from q
)
select i, txt, t.v <#> qv.v as distance
from llm_test t, qv
order by distance;

许可证

本项目采用 MIT 许可证发布。完整文本请参见 LICENSE。

本仓库重新分发来自 intfloat/multilingual-e5-small 的原始模型权重，未作修改。按要求保留了原始 MIT 许可证。

致谢

原始模型：intfloat/multilingual-e5-small
ONNX 转换和打包：oga5

oga5/multilingual-e5-small-pg-onnx

作者 oga5

sentence-similarity onnx

↓ 0 ♥ 0

创建时间: 2025-10-13 13:27:26+00:00

更新时间: 2025-10-17 13:14:31+00:00

在 Hugging Face 上查看

文件 (6)

.gitattributes

LICENSE

README.md

encoder.onnx ONNX

text_to_embedding.onnx ONNX

tokenizer.onnx ONNX