说明文档
LLaVA-Onevision 模型卡片

还可以查看 Google Colab 演示,在免费层 Google Colab 实例上运行 Llava:
以下是 0.5B LLaVA-Onevision 模型的模型卡片,复制自原始 LLaVA-Onevision 模型卡片,您可以在这里找到。
模型详情
模型类型: LLaVA-Onevision 是一个开源的多模态 LLM,通过在 GPT 生成的多模态指令跟随数据上对 Qwen2 进行微调训练而成。 LLaVA-OneVision 是第一个能够同时突破开放 LMM 在三个重要计算机视觉场景性能边界的单一模型:单图像、多图像和视频场景。重要的是,LLaVA-OneVision 的设计允许在不同模态/场景之间进行强大的迁移学习,从而产生新的涌现能力。特别地,通过从图像到视频的任务迁移,展示了强大的视频理解和跨场景能力。
模型日期: LLaVA-Onevision-0.5-ov 于 2024 年 8 月添加。
论文或更多资源: https://llava-vl.github.io/
- 架构: SO400M + Qwen2
- 预训练阶段: LCS-558K,1 个 epoch,投影器
- 中间阶段: 470 万高质量合成数据的混合,1 个 epoch,全模型
- 最终图像阶段: 360 万单图像数据的混合,1 个 epoch,全模型
- OneVision 阶段: 160 万单图像/多图像/视频数据的混合,1 个 epoch,全模型
- 精度: bfloat16
如何使用模型
首先,确保已安装来自分支的 transformers 或 transformers >= 4.45.0。
该模型支持多图像和多提示生成。这意味着您可以在提示中传递多个图像。请务必通过应用聊天模板来遵循正确的提示模板:
使用 pipeline:
下面我们使用 "llava-hf/llava-onevision-qwen2-0.5b-ov-hf" 检查点。
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="llava-onevision-qwen2-0.5b-ov-hf")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
{"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
],
},
]
out = pipe(text=messages, max_new_tokens=20)
print(out)
>>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type': 'text', 'text': 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]
使用纯 transformers:
下面是一个示例脚本,用于在 GPU 设备上以 float16 精度运行生成:
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
# 定义聊天历史并使用 `apply_chat_template` 获取正确格式化的提示
# "content" 中的每个值必须是包含类型("text", "image")的字典列表
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What are these?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
从 transformers>=v4.48 开始,您还可以将图像/视频 URL 或本地路径传递给对话历史,让聊天模板处理其余部分。
聊天模板将为您加载图像并返回 torch.Tensor 格式的输入,您可以直接将其传递给 model.generate()
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
{"type": "text", "text": "What is shown in this image?"},
],
},
]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
output = model.generate(**inputs, max_new_tokens=50)
模型优化
通过 bitsandbytes 库进行 4 位量化
首先确保安装 bitsandbytes,pip install bitsandbytes 并确保可以使用 CUDA 兼容的 GPU 设备。只需更改上面的代码片段:
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ load_in_4bit=True
)
使用 Flash-Attention 2 进一步加速生成
首先确保安装 flash-attn。关于该包的安装,请参考 Flash Attention 原始仓库。只需更改上面的代码片段:
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ use_flash_attention_2=True
).to(0)
与 Transformers.js 一起使用
如果您还没有安装,可以从 NPM 安装 Transformers.js JavaScript 库:
npm i @huggingface/transformers
示例: 使用 PKV 缓存的多轮对话
import { AutoProcessor, AutoTokenizer, LlavaOnevisionForConditionalGeneration, RawImage } from '@huggingface/transformers';
// 加载 tokenizer、processor 和模型
const model_id = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaOnevisionForConditionalGeneration.from_pretrained(model_id, {
dtype: {
embed_tokens: 'fp16', // or 'fp32' or 'q8'
vision_encoder: 'fp16', // or 'fp32' or 'q8'
decoder_model_merged: 'q4', // or 'q8'
},
// device: 'webgpu',
});
// 准备文本输入
const prompt = 'What does the text say?';
const messages = [
{ role: 'system', content: 'Answer the question.' },
{ role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);
// 准备视觉输入
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);
// 生成响应
const { past_key_values, sequences } = await model.generate({
...text_inputs,
...vision_inputs,
do_sample: false,
max_new_tokens: 64,
return_dict_in_generate: true,
});
// 解码输出
const answer = tokenizer.decode(
sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
{ skip_special_tokens: true },
);
console.log(answer);
// The text says "small but mighty" in a playful font.
const new_messages = [
...messages,
{ role: 'assistant', content: answer },
{ role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);
// 生成另一个响应
const output = await model.generate({
...new_text_inputs,
past_key_values,
do_sample: false,
max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
{ skip_special_tokens: true },
);
console.log(new_answer);
// The text "small but mighty" is likely a playful or humorous reference to the image of the blue mouse with the orange dumbbell. It could be used as a motivational phrase or a playful way to express the idea that even small things can be impressive or powerful.
引用
@misc{li2024llavaonevisioneasyvisualtask,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li},
year={2024},
eprint={2408.03326},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.03326},
}
llava-hf/llava-onevision-qwen2-0.5b-ov-hf
作者 llava-hf
创建时间: 2024-08-13 08:28:18+00:00
更新时间: 2025-06-18 13:57:09+00:00
在 Hugging Face 上查看