说明文档

<h1 align="center">UForm</h1> <h3 align="center"> 多模态推理库<br/> 面向语义搜索应用<br/> </h3>

UForm 是一个多模态推理包，旨在将多语言文本、图像，以及即将支持的音频、视频和文档编码到共享的向量空间中！

这是__仅英语模型__的模型卡片，包含：

4层 BERT（2层用于单模态编码，其余层用于多模态编码）
ViT-S/16（图像分辨率为 224x224）

如果您需要多语言模型，请查看这里。

评估

以下指标通过多模态重排序（文本到图像检索）获得：

数据集	Recall@1	Recall@5	Recall@10
Zero-Shot Flickr	0.565	0.790	0.860
Zero-Shot MS-COCO	0.281	0.525	0.645

ImageNet-Top1: 0.361
ImageNet-Top5: 0.608

安装

pip install uform[onnx]

使用方法

加载模型：

import uform

model, processor = uform.get_model_onnx('unum-cloud/uform-vl-english-small', device='cpu', dtype='fp32')

编码数据：

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = processor.preprocess_image(image)
text_data = processor.preprocess_text(text)

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
score, joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask'],
    return_scores=True
)

有两种方法可以计算图像和文本之间的语义兼容性：余弦相似度和匹配分数。

余弦相似度

similarity 的取值范围为 [-1, 1]，1 表示完全匹配。

优点：

计算成本低。
只需要单模态嵌入，单模态编码比联合编码更快。
适合大规模集合的检索。

缺点：

仅考虑粗粒度特征。

匹配分数

与余弦相似度不同，仅靠单模态嵌入是不够的。需要联合嵌入，最终得到的 score 取值范围为 [0, 1]，1 表示完全匹配。

优点：

联合嵌入捕获细粒度特征。
适合重排序——对检索结果进行排序。

缺点：

资源消耗大。
不适合大规模集合的检索。

unum-cloud/uform-vl-english-small-cpu-fp32

作者 unum-cloud

feature-extraction transformers

↓ 1 ♥ 1

创建时间: 2024-03-28 11:57:49+00:00

更新时间: 2024-03-28 16:31:31+00:00

在 Hugging Face 上查看

文件 (7)

.gitattributes

README.md

config.json

image_encoder.onnx ONNX

reranker.onnx ONNX

text_encoder.onnx ONNX

tokenizer.json