ONNX 模型库
返回模型

说明文档

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="图片描述">

SmolVLM2-256M-Video

SmolVLM2-256M-Video 是一个轻量级多模态模型,专为分析视频内容而设计。该模型可处理视频、图像和文本输入,并生成文本输出——无论是回答有关媒体文件的问题、比较视觉内容,还是从图像中转录文字。尽管其体积小巧,视频推理仅需 1.38GB 的 GPU 显存。这种高效性使其特别适合需要在特定领域进行微调且计算资源可能受限的设备端应用。

模型概要

  • 开发者: Hugging Face 🤗
  • 模型类型: 多模态模型(图像/多图像/视频/文本)
  • 语言(NLP): 英语
  • 许可证: Apache 2.0
  • 架构: 基于 Idefics3(参见技术摘要)

资源

用途

SmolVLM2 可用于多模态(视频/图像/文本)任务的推理,其中输入由文本查询以及视频或一个或多个图像组成。文本和媒体文件可以任意交错,支持基于视觉内容的字幕生成、视觉问答和故事叙述等任务。该模型不支持图像或视频生成。

要在特定任务上微调 SmolVLM2,您可以参考微调教程

评估

我们在以下科学基准测试上评估了 SmolVLM2 系列的性能:

规模 Video-MME MLVU MVBench
2.2B 52.1 55.2 46.27
500M 42.2 47.3 39.73
256M 33.7 40.6 32.7

如何开始

您可以使用 transformers 加载、推理和微调 SmolVLM。请确保已安装 num2words、flash-attn 和最新版本的 transformers。 您可以按以下方式加载模型。

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

简单推理

您可以直接使用聊天模板预处理输入并直接传递

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},            
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

视频推理

要使用 SmolVLM2 进行视频推理,请确保已安装 decord。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

多图像交错推理

您可以使用聊天模板将多个媒体与文本交错。

import torch


messages = [
    {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is the similarity between these two images?"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"},            
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

模型优化

滥用与超范围使用

SmolVLM 不适用于高风险场景或影响个人福祉或生计的关键决策过程。该模型可能会生成看似真实但不准确的内容。滥用行为包括但不限于:

  • 禁止用途:
    • 评估或对个人进行评分(例如在就业、教育、信贷方面)
    • 关键自动化决策
    • 生成不可靠的事实性内容
  • 恶意活动:
    • 垃圾信息生成
    • 虚假信息活动
    • 骚扰或滥用
    • 未经授权的监控

许可证

SmolVLM2 以 SigLIP 作为图像编码器,以 SmolLM2 作为文本解码器部分。

我们在 Apache 2.0 许可证下发布 SmolVLM2 检查点。

引用信息

您可以按以下方式引用我们:

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  author={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  year={2025}
}

训练数据

SmolVLM2 使用了 330 万个样本进行训练,这些样本最初来自十个不同的数据集:LlaVa OnevisionM4-InstructMammothLlaVa Video 178KFineVideoVideoStarVRiptVista-400KMovieChatShareGPT4Video。 在以下图表中,我们概述了各模态样本的分布及其来源。 <!-- <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="图片描述"> </center>

详情

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="图片描述"> -->

各模态数据分布

数据类型 百分比
图像 34.4%
文本 20.2%
视频 33.0%
多图像 12.3%

各模态细粒度数据集切片

文本数据集

数据集 百分比
llava-onevision/magpie_pro_ft3_80b_mt 6.8%
llava-onevision/magpie_pro_ft3_80b_tt 6.8%
llava-onevision/magpie_pro_qwen2_72b_tt 5.8%
llava-onevision/mathqa 0.9%

多图像数据集

数据集 百分比
m4-instruct-data/m4_instruct_multiimage 10.4%
mammoth/multiimage-cap6 1.9%

图像数据集

数据集 百分比
llava-onevision/other 17.4%
llava-onevision/vision_flan 3.9%
llava-onevision/mavis_math_metagen 2.6%
llava-onevision/mavis_math_rule_geo 2.5%
llava-onevision/sharegpt4o 1.7%
llava-onevision/sharegpt4v_coco 1.5%
llava-onevision/image_textualization 1.3%
llava-onevision/sharegpt4v_llava 0.9%
llava-onevision/mapqa 0.9%
llava-onevision/qa 0.8%
llava-onevision/textocr 0.8%

视频数据集

数据集 百分比
llava-video-178k/1-2m 7.3%
llava-video-178k/2-3m 7.0%
other-video/combined 5.7%
llava-video-178k/hound 4.4%
llava-video-178k/0-30s 2.4%
video-star/starb 2.2%
vista-400k/combined 2.2%
vript/long 1.0%
ShareGPT4Video/all 0.8%

configint/SmolVLM2-256M-Video-Instruct-ActionTokens

作者 configint

image-text-to-text transformers
↓ 1 ♥ 0

创建时间: 2025-07-26 09:33:09+00:00

更新时间: 2025-07-26 09:36:48+00:00

在 Hugging Face 上查看

文件 (194)

.gitattributes
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/.gitattributes
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/README.md
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/added_tokens.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/chat_template.jinja
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/chat_template.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/config.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/generation_config.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/merges.txt
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/model.safetensors
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_bnb4.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_fp16.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_int8.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_q4.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_q4f16.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_quantized.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_uint8.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_bnb4.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_fp16.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_int8.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_q4.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_q4f16.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_quantized.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_uint8.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_bnb4.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_fp16.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_int8.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_q4.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_q4f16.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_quantized.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_uint8.onnx ONNX
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/preprocessor_config.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/processor_config.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/special_tokens_map.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/tokenizer.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/tokenizer_config.json
HuggingFaceTB_SmolVLM2-500M-Video-Instruct/vocab.json
README.md
SmolVLM2-500M-Video-Instruct-Action/.gitattributes
SmolVLM2-500M-Video-Instruct-Action/README.md
SmolVLM2-500M-Video-Instruct-Action/added_tokens.json
SmolVLM2-500M-Video-Instruct-Action/chat_template.jinja
SmolVLM2-500M-Video-Instruct-Action/chat_template.json
SmolVLM2-500M-Video-Instruct-Action/config.json
SmolVLM2-500M-Video-Instruct-Action/generation_config.json
SmolVLM2-500M-Video-Instruct-Action/merges.txt
SmolVLM2-500M-Video-Instruct-Action/model.safetensors
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/decoder_model_merged_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/embed_tokens_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/onnx/vision_encoder_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct-Action/preprocessor_config.json
SmolVLM2-500M-Video-Instruct-Action/processor_config.json
SmolVLM2-500M-Video-Instruct-Action/special_tokens_map.json
SmolVLM2-500M-Video-Instruct-Action/tokenizer.json
SmolVLM2-500M-Video-Instruct-Action/tokenizer_config.json
SmolVLM2-500M-Video-Instruct-Action/vocab.json
SmolVLM2-500M-Video-Instruct-ActionTokens/.gitattributes
SmolVLM2-500M-Video-Instruct-ActionTokens/README.md
SmolVLM2-500M-Video-Instruct-ActionTokens/added_tokens.json
SmolVLM2-500M-Video-Instruct-ActionTokens/chat_template.jinja
SmolVLM2-500M-Video-Instruct-ActionTokens/chat_template.json
SmolVLM2-500M-Video-Instruct-ActionTokens/config.json
SmolVLM2-500M-Video-Instruct-ActionTokens/generation_config.json
SmolVLM2-500M-Video-Instruct-ActionTokens/merges.txt
SmolVLM2-500M-Video-Instruct-ActionTokens/model.safetensors
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/decoder_model_merged_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/embed_tokens_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/onnx/vision_encoder_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct-ActionTokens/preprocessor_config.json
SmolVLM2-500M-Video-Instruct-ActionTokens/processor_config.json
SmolVLM2-500M-Video-Instruct-ActionTokens/special_tokens_map.json
SmolVLM2-500M-Video-Instruct-ActionTokens/tokenizer.json
SmolVLM2-500M-Video-Instruct-ActionTokens/tokenizer_config.json
SmolVLM2-500M-Video-Instruct-ActionTokens/vocab.json
SmolVLM2-500M-Video-Instruct/.gitattributes
SmolVLM2-500M-Video-Instruct/README.md
SmolVLM2-500M-Video-Instruct/added_tokens.json
SmolVLM2-500M-Video-Instruct/chat_template.json
SmolVLM2-500M-Video-Instruct/config.json
SmolVLM2-500M-Video-Instruct/generation_config.json
SmolVLM2-500M-Video-Instruct/merges.txt
SmolVLM2-500M-Video-Instruct/model.safetensors
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/decoder_model_merged_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/embed_tokens_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_bnb4.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_fp16.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_int8.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_q4.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_q4f16.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_quantized.onnx ONNX
SmolVLM2-500M-Video-Instruct/onnx/vision_encoder_uint8.onnx ONNX
SmolVLM2-500M-Video-Instruct/preprocessor_config.json
SmolVLM2-500M-Video-Instruct/processor_config.json
SmolVLM2-500M-Video-Instruct/special_tokens_map.json
SmolVLM2-500M-Video-Instruct/tokenizer.json
SmolVLM2-500M-Video-Instruct/tokenizer_config.json
SmolVLM2-500M-Video-Instruct/vocab.json
added_tokens.json
chat_template.jinja
chat_template.json
config.json
generation_config.json
merges.txt
model.safetensors
onnx/decoder_model_merged.onnx ONNX
onnx/decoder_model_merged_bnb4.onnx ONNX
onnx/decoder_model_merged_fp16.onnx ONNX
onnx/decoder_model_merged_int8.onnx ONNX
onnx/decoder_model_merged_q4.onnx ONNX
onnx/decoder_model_merged_q4f16.onnx ONNX
onnx/decoder_model_merged_quantized.onnx ONNX
onnx/decoder_model_merged_uint8.onnx ONNX
onnx/embed_tokens.onnx ONNX
onnx/embed_tokens_bnb4.onnx ONNX
onnx/embed_tokens_fp16.onnx ONNX
onnx/embed_tokens_int8.onnx ONNX
onnx/embed_tokens_q4.onnx ONNX
onnx/embed_tokens_q4f16.onnx ONNX
onnx/embed_tokens_quantized.onnx ONNX
onnx/embed_tokens_uint8.onnx ONNX
onnx/vision_encoder.onnx ONNX
onnx/vision_encoder_bnb4.onnx ONNX
onnx/vision_encoder_fp16.onnx ONNX
onnx/vision_encoder_int8.onnx ONNX
onnx/vision_encoder_q4.onnx ONNX
onnx/vision_encoder_q4f16.onnx ONNX
onnx/vision_encoder_quantized.onnx ONNX
onnx/vision_encoder_uint8.onnx ONNX
preprocessor_config.json
processor_config.json
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json