说明文档

SmolVLM-500M

SmolVLM-500M 是一款轻量级的多模态模型，属于 SmolVLM 系列。它可以接收任意顺序的图像和文本输入，并生成文本输出，专为高效推理而设计。SmolVLM 可以回答关于图片的问题、描述视觉内容或转录文字。其轻量级架构使其适用于设备端应用，同时在多模态任务中保持出色的性能。它可以在仅需 1.23GB GPU 显存的情况下完成单张图像的推理。

模型概要

开发者： Hugging Face 🤗
模型类型： 多模态模型（图像+文本）
语言（NLP）： 英语
许可证： Apache 2.0
架构： 基于 Idefics3（见技术摘要）

资源

演示： SmolVLM-256 演示
博客： 博客文章

应用场景

SmolVLM 可用于对多模态（图像+文本）任务进行推理，输入包含文本查询和一张或多张图像。文本和图像可以任意交织，实现图像描述生成、视觉问答和基于视觉内容的故事创作等任务。该模型不支持图像生成。

要针对特定任务微调 SmolVLM，可以参考微调教程。

评估

技术摘要

SmolVLM 采用轻量级的 SmolLM2 语言模型，提供紧凑而强大的多模态体验。与较大的 SmolVLM 2.2B 模型相比，它进行了多项改进：

图像压缩： 与 Idefics3 和 SmolVLM-2.2B 相比，我们采用了更激进的图像压缩方案，使模型推理更快、显存占用更少。
视觉 Token 编码： SmolVLM-256 使用 64 个视觉 token 对 512×512 大小的图像块进行编码。较大的图像会被分割成多个图块分别编码，在不影响性能的情况下提高效率。
新的特殊 token： 我们添加了新的特殊 token 来分割子图像。这使得图像的 token 化更加高效。
更小的视觉 encoder： 我们将 400M 参数的 SigLIP 视觉 encoder 大幅缩减到仅 93M 参数。
更大的图像块： 现在我们将 512x512 的图块传递给视觉 encoder，而不是像较大的 SmolVLM 那样使用 384x384。这使得信息编码更加高效。

更多关于训练和架构的详细信息请参阅我们的技术报告。

快速入门

你可以使用 transformers 来加载、推理和微调 SmolVLM。

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-500M-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])
"""
Assistant: The image depicts a cityscape featuring a prominent landmark, the Statue of Liberty, prominently positioned on Liberty Island. The statue is a green, humanoid figure with a crown atop its head and is situated on a small island surrounded by water. The statue is characterized by its large, detailed structure, with a statue of a woman holding a torch above her head and a tablet in her left hand. The statue is surrounded by a small, rocky island, which is partially visible in the foreground.
In the background, the cityscape is dominated by numerous high-rise buildings, which are densely packed and vary in height. The buildings are primarily made of glass and steel, reflecting the sunlight and creating a bright, urban skyline. The skyline is filled with various architectural styles, including modern skyscrapers and older, more traditional buildings.
The water surrounding the island is calm, with a few small boats visible, indicating that the area is likely a popular tourist destination. The water is a deep blue, suggesting that it is a large body of water, possibly a river or a large lake.
In the foreground, there is a small strip of land with trees and grass, which adds a touch of natural beauty to the urban landscape. The trees are green, indicating that it is likely spring or summer.
The image captures a moment of tranquility and reflection, as the statue and the cityscape come together to create a harmonious and picturesque scene. The statue's presence in the foreground draws attention to the city's grandeur, while the calm water and natural elements in the background provide a sense of peace and serenity.
In summary, the image showcases the Statue of Liberty, a symbol of freedom and democracy, set against a backdrop of a bustling cityscape. The statue is a prominent and iconic representation of human achievement, while the cityscape is a testament to human ingenuity and progress. The image captures the beauty and complexity of urban life, with the statue serving as a symbol of hope and freedom, while the cityscape provides a glimpse into the modern world.
"""

模型优化

精度： 为获得更好的性能，如果硬件支持，请以半精度（torch.bfloat16）加载和运行模型。

from transformers import AutoModelForVision2Seq
import torch

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16
).to("cuda")

你也可以使用 bitsandbytes、torchao 或 Quanto 以 4/8 位量化加载 SmolVLM。其他选项请参阅此页面。

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    quantization_config=quantization_config,
)

视觉 Encoder 效率： 初始化处理器时，通过设置 size={"longest_edge": N*512} 来调整图像分辨率，其中 N 是你想要的数值。默认的 N=4 效果良好，对应 2048×2048 的输入图像尺寸。减小 N 可以节省 GPU 显存，适用于较低分辨率的图像。如果你想要对视频进行微调，这也很实用。

误用与超出范围的使用

SmolVLM 不适用于影响个人福祉或生计的高风险场景或关键决策流程。模型生成的内容可能看起来像事实但可能不准确。误用包括但不限于：

禁止用途：
- 对个人进行评估或打分（如就业、教育、信贷）
- 关键自动化决策
- 生成不可靠的事实内容
恶意活动：
- 垃圾信息生成
- 虚假信息活动
- 骚扰或滥用
- 未经授权的监控

许可证

SmolVLM 基于 SigLIP 作为图像 encoder，SmolLM2 作为文本解码器部分。

我们以 Apache 2.0 许可证发布 SmolVLM 检查点。

训练详情

训练数据

训练数据来自 The Cauldron 和 Docmatix 数据集，重点关注文档理解（25%）和图像描述生成（18%），同时保持对视觉推理、图表理解和通用指令遵循等其他关键能力的均衡覆盖。

引用信息

你可以按以下方式引用我们：

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models}, 
  author={Andrés Marafioti and Orr Zohar and Miquel Farré and Merve Noyan and Elie Bakouch and Pedro Cuenca and Cyril Zakka and Loubna Ben Allal and Anton Lozhkov and Nouamane Tazi and Vaibhav Srivastav and Joshua Lochner and Hugo Larcher and Mathieu Morlon and Lewis Tunstall and Leandro von Werra and Thomas Wolf},
  journal={arXiv preprint arXiv:2504.05299},
  year={2025}
}

HuggingFaceTB/SmolVLM-500M-Instruct

作者 HuggingFaceTB

image-text-to-text transformers

↓ 36K ♥ 189

创建时间: 2025-01-20 14:24:51+00:00

更新时间: 2025-04-08 07:26:28+00:00

在 Hugging Face 上查看

文件 (38)

.gitattributes

README.md

added_tokens.json

chat_template.json

config.json

generation_config.json

merges.txt

model.safetensors

onnx/decoder_model_merged.onnx ONNX

onnx/decoder_model_merged_bnb4.onnx ONNX

onnx/decoder_model_merged_fp16.onnx ONNX

onnx/decoder_model_merged_int8.onnx ONNX

onnx/decoder_model_merged_q4.onnx ONNX

onnx/decoder_model_merged_q4f16.onnx ONNX

onnx/decoder_model_merged_quantized.onnx ONNX

onnx/decoder_model_merged_uint8.onnx ONNX

onnx/embed_tokens.onnx ONNX

onnx/embed_tokens_bnb4.onnx ONNX

onnx/embed_tokens_fp16.onnx ONNX

onnx/embed_tokens_int8.onnx ONNX

onnx/embed_tokens_q4.onnx ONNX

onnx/embed_tokens_q4f16.onnx ONNX

onnx/embed_tokens_quantized.onnx ONNX

onnx/embed_tokens_uint8.onnx ONNX

onnx/vision_encoder.onnx ONNX

onnx/vision_encoder_bnb4.onnx ONNX

onnx/vision_encoder_fp16.onnx ONNX

onnx/vision_encoder_int8.onnx ONNX

onnx/vision_encoder_q4.onnx ONNX

onnx/vision_encoder_q4f16.onnx ONNX

onnx/vision_encoder_quantized.onnx ONNX

onnx/vision_encoder_uint8.onnx ONNX

preprocessor_config.json

processor_config.json

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.json