说明文档

LLaVA-Onevision 模型卡片

image/png

还可以查看 Google Colab 演示，在免费层 Google Colab 实例上运行 Llava：

以下是 0.5B LLaVA-Onevision 模型的模型卡片，复制自原始 LLaVA-Onevision 模型卡片，您可以在这里找到。

模型详情

模型类型： LLaVA-Onevision 是一个开源的多模态 LLM，通过在 GPT 生成的多模态指令跟随数据上对 Qwen2 进行微调训练而成。 LLaVA-OneVision 是第一个能够同时突破开放 LMM 在三个重要计算机视觉场景性能边界的单一模型：单图像、多图像和视频场景。重要的是，LLaVA-OneVision 的设计允许在不同模态/场景之间进行强大的迁移学习，从而产生新的涌现能力。特别地，通过从图像到视频的任务迁移，展示了强大的视频理解和跨场景能力。

模型日期： LLaVA-Onevision-0.5-ov 于 2024 年 8 月添加。

论文或更多资源： https://llava-vl.github.io/

架构： SO400M + Qwen2
预训练阶段： LCS-558K，1 个 epoch，投影器
中间阶段： 470 万高质量合成数据的混合，1 个 epoch，全模型
最终图像阶段： 360 万单图像数据的混合，1 个 epoch，全模型
OneVision 阶段： 160 万单图像/多图像/视频数据的混合，1 个 epoch，全模型
精度： bfloat16

如何使用模型

首先，确保已安装来自分支的 transformers 或 transformers >= 4.45.0。该模型支持多图像和多提示生成。这意味着您可以在提示中传递多个图像。请务必通过应用聊天模板来遵循正确的提示模板：

使用 `pipeline`：

下面我们使用 "llava-hf/llava-onevision-qwen2-0.5b-ov-hf" 检查点。

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-onevision-qwen2-0.5b-ov-hf")
messages = [
    {
      "role": "user",
      "content": [
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
          {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
        ],
    },
]

out = pipe(text=messages, max_new_tokens=20)
print(out)
>>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type': 'text', 'text': 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]

使用纯 `transformers`：

下面是一个示例脚本，用于在 GPU 设备上以 float16 精度运行生成：

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

# 定义聊天历史并使用 `apply_chat_template` 获取正确格式化的提示
# "content" 中的每个值必须是包含类型（"text", "image"）的字典列表
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

从 transformers>=v4.48 开始，您还可以将图像/视频 URL 或本地路径传递给对话历史，让聊天模板处理其余部分。聊天模板将为您加载图像并返回 torch.Tensor 格式的输入，您可以直接将其传递给 model.generate()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
            {"type": "text", "text": "What is shown in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
output = model.generate(**inputs, max_new_tokens=50)

模型优化

通过 `bitsandbytes` 库进行 4 位量化

首先确保安装 bitsandbytes，pip install bitsandbytes 并确保可以使用 CUDA 兼容的 GPU 设备。只需更改上面的代码片段：

model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

使用 Flash-Attention 2 进一步加速生成

首先确保安装 flash-attn。关于该包的安装，请参考 Flash Attention 原始仓库。只需更改上面的代码片段：

model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

与 Transformers.js 一起使用

如果您还没有安装，可以从 NPM 安装 Transformers.js JavaScript 库：

npm i @huggingface/transformers

示例： 使用 PKV 缓存的多轮对话

import { AutoProcessor, AutoTokenizer, LlavaOnevisionForConditionalGeneration, RawImage } from '@huggingface/transformers';

// 加载 tokenizer、processor 和模型
const model_id = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf';

const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaOnevisionForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16', // or 'fp32' or 'q8'
        vision_encoder: 'fp16', // or 'fp32' or 'q8'
        decoder_model_merged: 'q4', // or 'q8'
    },
    // device: 'webgpu',
});

// 准备文本输入
const prompt = 'What does the text say?';
const messages = [
    { role: 'system', content: 'Answer the question.' },
    { role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);

// 准备视觉输入
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// 生成响应
const { past_key_values, sequences } = await model.generate({
    ...text_inputs,
    ...vision_inputs,
    do_sample: false,
    max_new_tokens: 64,
    return_dict_in_generate: true,
});

// 解码输出
const answer = tokenizer.decode(
    sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(answer);
// The text says "small but mighty" in a playful font.

const new_messages = [
    ...messages,
    { role: 'assistant', content: answer },
    { role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);

// 生成另一个响应
const output = await model.generate({
    ...new_text_inputs,
    past_key_values,
    do_sample: false,
    max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
    output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(new_answer);
// The text "small but mighty" is likely a playful or humorous reference to the image of the blue mouse with the orange dumbbell. It could be used as a motivational phrase or a playful way to express the idea that even small things can be impressive or powerful.

引用

@misc{li2024llavaonevisioneasyvisualtask,
      title={LLaVA-OneVision: Easy Visual Task Transfer}, 
      author={Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li},
      year={2024},
      eprint={2408.03326},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.03326}, 
}

llava-hf/llava-onevision-qwen2-0.5b-ov-hf

作者 llava-hf

image-text-to-text transformers

↓ 603.5K ♥ 54

创建时间: 2024-08-13 08:28:18+00:00

更新时间: 2025-06-18 13:57:09+00:00

在 Hugging Face 上查看

文件 (41)

.gitattributes

README.md

added_tokens.json

chat_template.json

config.json

generation_config.json

llava_onevision_arch.png

merges.txt

model.safetensors

onnx/decoder_model_merged.onnx ONNX

onnx/decoder_model_merged_bnb4.onnx ONNX

onnx/decoder_model_merged_fp16.onnx ONNX

onnx/decoder_model_merged_int8.onnx ONNX

onnx/decoder_model_merged_q4.onnx ONNX

onnx/decoder_model_merged_q4f16.onnx ONNX

onnx/decoder_model_merged_quantized.onnx ONNX

onnx/decoder_model_merged_uint8.onnx ONNX

onnx/embed_tokens.onnx ONNX

onnx/embed_tokens_bnb4.onnx ONNX

onnx/embed_tokens_fp16.onnx ONNX

onnx/embed_tokens_int8.onnx ONNX

onnx/embed_tokens_q4.onnx ONNX

onnx/embed_tokens_q4f16.onnx ONNX

onnx/embed_tokens_quantized.onnx ONNX

onnx/embed_tokens_uint8.onnx ONNX

onnx/vision_encoder.onnx ONNX

onnx/vision_encoder_bnb4.onnx ONNX

onnx/vision_encoder_fp16.onnx ONNX

onnx/vision_encoder_int8.onnx ONNX

onnx/vision_encoder_q4.onnx ONNX

onnx/vision_encoder_q4f16.onnx ONNX

onnx/vision_encoder_quantized.onnx ONNX

onnx/vision_encoder_uint8.onnx ONNX

preprocessor_config.json

processor_config.json

special_tokens_map.json

tokenizer.json

tokenizer_config.json

video_preprocessor_config.json

video_processor/preprocessor_config.json

vocab.json