返回模型

说明文档

SmolVLM-256M

SmolVLM-256M 是世界上最小的多模态模型。它接受任意序列的图像和文本输入，产生文本输出。它专为效率而设计。SmolVLM 可以回答关于图像的问题、描述视觉内容或转录文本。其轻量级架构使其适用于设备端应用，同时在多模态任务上保持强劲性能。它可以在不到 1GB 的 GPU 内存上对一张图像进行推理。

模型概要

开发者： Hugging Face 🤗
模型类型： 多模态模型（图像+文本）
语言： 英语
许可证： Apache 2.0
架构： 基于 Idefics3（见技术摘要）

资源

演示： SmolVLM-256 演示
博客： 博客文章

用途

SmolVLM 可用于多模态（图像+文本）任务的推理，其中输入包括文本查询以及一个或多个图像。文本和图像可以任意交错，支持图像描述、视觉问答和基于视觉内容的叙事等任务。该模型不支持图像生成。

要在特定任务上微调 SmolVLM，您可以参考微调教程。

技术摘要

SmolVLM 利用轻量级的 SmolLM2 语言模型提供紧凑而强大的多模态体验。与更大的 SmolVLM 2.2B 模型相比，它引入了多项改进：

图像压缩： 与 Idefics3 和 SmolVLM-2.2B 相比，我们引入了更激进的图像压缩，使模型推理更快、内存使用更少。
视觉标记编码： SmolVLM-256 使用 64 个视觉标记来编码 512×512 大小的图像块。较大的图像会被分割成多个块，每个块单独编码，在不影响性能的情况下提高效率。
新的特殊标记： 我们添加了新的特殊标记来分隔子图像，使图像标记化更加高效。
更小的视觉编码器： 我们从 4 亿参数的 siglip 视觉编码器换成了更小的 9300 万参数编码器。
更大的图像块： 我们现在将 512×512 的块传递给视觉编码器，而不是像更大的 SmolVLM 那样使用 384×384。这使得信息编码更高效。

有关训练和架构的更多详情，请参阅我们的技术报告。

如何开始

您可以使用 transformers 来加载、推理和微调 SmolVLM。

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# 加载图像
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# 初始化处理器和模型
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-256M-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# 创建输入消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

# 准备输入
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# 生成输出
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])
"""
Assistant: The image depicts a large, historic statue of liberty, located in New York City. The statue is a green, cylindrical structure with a human figure at the top, holding a torch. The statue is situated on a pedestal that resembles the statue of liberty, which is located on a small island in the middle of a body of water. The water surrounding the island is calm, reflecting the blue sky and the statue.
In the background, there are several tall buildings, including the Empire State Building, which is visible in the distance. These buildings are made of glass and steel, and they are positioned in a grid-like pattern, giving them a modern look. The sky is clear, with a few clouds visible, indicating fair weather.
The statue is surrounded by trees, which are green and appear to be healthy. There are also some small structures, possibly houses or buildings, visible in the distance. The overall scene suggests a peaceful and serene environment, typical of a cityscape.
The image is taken during the daytime, likely during the day of the statue's installation. The lighting is bright, casting a strong shadow on the statue and the water, which enhances the visibility of the statue and the surrounding environment.
To summarize, the image captures a significant historical statue of liberty, situated on a small island in the middle of a body of water, surrounded by trees and buildings. The sky is clear, with a few clouds visible, indicating fair weather. The statue is green and cylindrical, with a human figure holding a torch, and is surrounded by trees, indicating a peaceful and well-maintained environment. The overall scene is one of tranquility and historical significance.
"""

我们还提供模型的 ONNX 权重，您可以使用 ONNX Runtime 运行，如下所示： <details>

<summary>点击此处查看示例代码</summary>

from transformers import AutoConfig, AutoProcessor
from transformers.image_utils import load_image
import onnxruntime
import numpy as np

# 1. 加载模型
## 加载配置和处理器
model_id = "HuggingFaceTB/SmolVLM-256M-Instruct"
config = AutoConfig.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

## 加载会话
## !wget https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct/resolve/main/onnx/vision_encoder.onnx
## !wget https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct/resolve/main/onnx/embed_tokens.onnx
## !wget https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct/resolve/main/onnx/decoder_model_merged.onnx
vision_session = onnxruntime.InferenceSession("vision_encoder.onnx")
embed_session = onnxruntime.InferenceSession("embed_tokens.onnx")
decoder_session = onnxruntime.InferenceSession("decoder_model_merged.onnx")

## 设置配置值
num_key_value_heads = config.text_config.num_key_value_heads
head_dim = config.text_config.head_dim
num_hidden_layers = config.text_config.num_hidden_layers
eos_token_id = config.text_config.eos_token_id
image_token_id = config.image_token_id


# 2. 准备输入
## 创建输入消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

## 加载图像并应用处理器
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="np")

## 准备解码器输入
batch_size = inputs['input_ids'].shape[0]
past_key_values = {
    f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
    for layer in range(num_hidden_layers)
    for kv in ('key', 'value')
}
image_features = None
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
position_ids = np.cumsum(inputs['attention_mask'], axis=-1)


# 3. 生成循环
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
  inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]

  if image_features is None:
    ## 仅在尚未计算时计算视觉特征
    image_features = vision_session.run(
        ['image_features'],  # 输出名称或索引列表
        {
            'pixel_values': inputs['pixel_values'],
            'pixel_attention_mask': inputs['pixel_attention_mask'].astype(np.bool_)
        }
    )[0]
    
    ## 合并文本和视觉嵌入
    inputs_embeds[inputs['input_ids'] == image_token_id] = image_features.reshape(-1, image_features.shape[-1])

  logits, *present_key_values = decoder_session.run(None, dict(
      inputs_embeds=inputs_embeds,
      attention_mask=attention_mask,
      position_ids=position_ids,
      **past_key_values,
  ))

  ## 更新下一轮生成循环的值
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
  attention_mask = np.ones_like(input_ids)
  position_ids = position_ids[:, -1:] + 1
  for j, key in enumerate(past_key_values):
    past_key_values[key] = present_key_values[j]

  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
  if (input_ids == eos_token_id).all():
    break

  ## （可选）流式输出
  print(processor.decode(input_ids[0]), end='')
print()

# 4. 输出结果
print(processor.batch_decode(generated_tokens))

示例输出：

 The image depicts a large, historic statue of Liberty situated on a small island in a body of water. The statue is a green, cylindrical structure with a human figure at the top, which is the actual statue of Liberty. The statue is mounted on a pedestal that is supported by a cylindrical tower. The pedestal is rectangular and appears to be made of stone or a similar material. The statue is surrounded by a large, flat, rectangular area that is likely a base for the statue.

In the background, there is a cityscape with a variety of buildings, including skyscrapers and high-rise buildings. The sky is clear with a gradient of colors, transitioning from a pale blue at the top to a deeper blue at the bottom. The buildings are mostly modern, with a mix of glass and concrete. The buildings are densely packed, with many skyscrapers and high-rise buildings visible.

There are trees and greenery visible on the left side of the image, indicating that the statue is located near a park or a park area. The water in the foreground is calm, with small ripples indicating that the statue is in the water.

The overall scene suggests a peaceful and serene environment, likely a public park or a park area in a city. The statue is likely a representation of liberty, representing the city's commitment to freedom and democracy.

### Analysis and Description:

#### Statue of Liberty:
- **Location**: The statue is located on a small island in a body of water.
- **Statue**: The statue is a green cylindrical structure with a human figure at the top, which is the actual statue of Liberty.
- **Pedestal**: The pedestal is rectangular and supports the statue.
- **Pedestrian**: The pedestal is surrounded by a flat rectangular area.
- **Water**: The water is calm, with small ripples indicating that the statue is in the water.

#### Cityscape:
- **Buildings**: The buildings are modern, with a mix of glass and concrete.
- **Sky**: The sky is clear with a gradient of colors, transitioning from a pale blue at the top to a deeper blue at the bottom.
- **Trees**: There are trees and greenery visible on the left side of the image, indicating that the statue is located near a park or a park area.

#### Environment:
- **Water**: The water is calm, with small ripples indicating that the statue is in the water.
- **Sky**: The sky is clear with a gradient of colors, transitioning from a pale blue at the top to a deeper blue at the bottom.

### Conclusion:
The image depicts a peaceful and serene public park or park area in a city, with the statue of Liberty prominently featured. The cityscape in the background includes modern buildings and a clear sky, suggesting a well-maintained public space.<end_of_utterance>

</details>

模型优化

精度： 为获得更好的性能，如果您的硬件支持，请以半精度（torch.bfloat16）加载和运行模型。

from transformers import AutoModelForVision2Seq
import torch

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16
).to("cuda")

您还可以使用 bitsandbytes、torchao 或 Quanto 以 4/8 位量化加载 SmolVLM。有关其他选项，请参阅此页面。

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    quantization_config=quantization_config,
)

视觉编码器效率： 在初始化处理器时，可以通过设置 size={"longest_edge": N*512} 来调整图像分辨率，其中 N 是您想要的值。默认值 N=4 效果良好，生成的输入图像大小为 2048×2048。减小 N 可以节省 GPU 内存，适用于较低分辨率的图像。如果您想在视频上进行微调，这也很有用。

滥用和超范围使用

SmolVLM 不适用于影响个人福祉或生计的高风险场景或关键决策过程。模型可能产生看似真实但不准确的内容。滥用包括但不限于：

禁止用途：
- 评估或对个人进行评分（如就业、教育、信贷）
- 关键自动化决策
- 生成不可靠的事实内容
恶意活动：
- 垃圾信息生成
- 虚假信息活动
- 骚扰或虐待
- 未经授权的监控

许可证

SmolVLM 以 SigLIP 作为图像编码器，以 SmolLM2 作为文本解码器部分。

我们在 Apache 2.0 许可证下发布 SmolVLM 检查点。

训练详情

训练数据

训练数据来自 The Cauldron 和 Docmatix 数据集，重点关注文档理解（25%）和图像描述（18%），同时保持对其他关键能力如视觉推理、图表理解和通用指令遵循的均衡覆盖。 <img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="示例图像" style="width:90%;" />

评估

大小	Mathvista	MMMU	OCRBench	MMStar	AI2D	ChartQA_Test	Science_QA	TextVQA Val	DocVQA Val
256M	35.9	28.3	52.6	34.6	47	55.8	73.6	49.9	58.3
500M	40.1	33.7	61	38.3	59.5	63.2	79.7	60.5	70.5
2.2B	43.9	38.3	65.5	41.8	64	71.6	84.5	72.1	79.7

引用信息

您可以通过以下方式引用我们：

@unpublished{marafioti2025smolvlm,
  title = {SmolVLM: Redefining small and efficient multimodal models},
  author = {Marafioti, Andr\'{e}s and Zohar, Orr and Farr\'{e}, Miquel and Noyan, Merve and Bakouch, Elie and Cuenca, Pedro and Zakka, Cyril and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Srivastav, Vaibhav and Lochner, Joshua and Larcher, Hugo and Morlon, Mathieu and Tunstall, Lewis and von Werra, Leandro and Wolf, Thomas},
  year = {2025},
}

xet-team/SmolVLM-256M-Instruct-test

作者 xet-team

image-text-to-text transformers

↓ 1 ♥ 0

创建时间: 2025-03-26 19:21:08+00:00

更新时间: 2025-03-26 20:55:04+00:00

在 Hugging Face 上查看

文件 (37)

.gitattributes

README.md

added_tokens.json

chat_template.json

config.json

generation_config.json

merges.txt

model.safetensors

onnx/decoder_model_merged.onnx ONNX

onnx/decoder_model_merged_bnb4.onnx ONNX

onnx/decoder_model_merged_fp16.onnx ONNX

onnx/decoder_model_merged_int8.onnx ONNX

onnx/decoder_model_merged_q4.onnx ONNX

onnx/decoder_model_merged_q4f16.onnx ONNX

onnx/decoder_model_merged_quantized.onnx ONNX

onnx/decoder_model_merged_uint8.onnx ONNX

onnx/embed_tokens.onnx ONNX

onnx/embed_tokens_bnb4.onnx ONNX

onnx/embed_tokens_fp16.onnx ONNX

onnx/embed_tokens_int8.onnx ONNX

onnx/embed_tokens_q4.onnx ONNX

onnx/embed_tokens_q4f16.onnx ONNX

onnx/embed_tokens_quantized.onnx ONNX

onnx/embed_tokens_uint8.onnx ONNX

onnx/vision_encoder.onnx ONNX

onnx/vision_encoder_bnb4.onnx ONNX

onnx/vision_encoder_fp16.onnx ONNX

onnx/vision_encoder_int8.onnx ONNX

onnx/vision_encoder_q4.onnx ONNX

onnx/vision_encoder_q4f16.onnx ONNX

onnx/vision_encoder_quantized.onnx ONNX

onnx/vision_encoder_uint8.onnx ONNX

preprocessor_config.json

processor_config.json

tokenizer.json

tokenizer_config.json

vocab.json