说明文档

LLaVA Interleave 模型卡片

模型详情

模型类型： LLaVA Interleave 是一个开源聊天机器人，通过在多模态指令跟随数据上微调 LLM 进行训练。它是一个基于 Transformer 架构的自回归语言模型。基础 LLM：Qwen/Qwen1.5-7B-Chat

论文或资源更多信息： https://llava-vl.github.io/

主要预期用途： LLaVA-Next Interleave 的主要用途是研究大型多模态模型和聊天机器人。这仅用于研究探索，禁止商业使用。

主要预期用户： 该模型的主要预期用户是计算机视觉、自然语言处理、机器学习和人工智能领域的研究人员和爱好者。

如何使用模型

首先，确保已安装 transformers >= 4.35.3。该模型支持多图像和多提示生成。这意味着你可以在提示中传入多张图像。请务必遵循正确的提示模板（USER: xxx\nASSISTANT:），并将 <image> token 添加到你想要查询图像的位置：

使用 `pipeline`：

下面我们使用了 "llava-hf/llava-interleave-qwen-0.5b-hf" 检查点。

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llava-interleave-qwen-0.5b-hf")
messages = [
    {
      "role": "user",
      "content": [
          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
          {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
        ],
    },
]

out = pipe(text=messages, max_new_tokens=20)
print(out)
>>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]

使用纯 `transformers`：

下面是一个示例脚本，用于在 GPU 设备上以 float16 精度运行生成：

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "llava-hf/llava-interleave-qwen-0.5b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

# 定义聊天历史并使用 `apply_chat_template` 获取正确格式化的提示
# "content" 中的每个值必须是一个字典列表，包含类型（"text", "image"）
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

当使用视频/3D/多视图输入进行提示时，请按以下方式提示：

# 如果你从输入中下采样了 n 帧

image_tokens = "<image>" * n
prompt = f"<|im_start|>user {image_tokens}\nWhat are these?|im_end|><|im_start|>assistant"

# 使用聊天模板，如果你采样了 5 帧，你必须在一轮对话中有 5 张图片
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
          {"type": "image"},
          {"type": "image"},
          {"type": "image"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

当使用交错图像和视频进行提示时，请按以下方式提示：

# 两个交错的图像
prompt = "<|im_start|>user <image><image>\nWhat is the difference between these two images?|im_end|><|im_start|>assistant"

# 两个交错的视频，如果你从两个视频中共下采样了 n 帧
image_tokens = "<image>" * n
prompt = f"<|im_start|>user {image_tokens}\nWhat are these?|im_end|><|im_start|>assistant"

# 交错格式的聊天模板与视频采样相同。只需按你想要的提示传入任意数量的图片
conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What is the difference between these two images?"},
          {"type": "image"},
          {"type": "image"},
        ],
    },
]

从 transformers>=v4.48 开始，你还可以将图像 URL 或本地路径传入对话历史，让聊天模板处理其余部分。聊天模板将为你加载图像并返回 torch.Tensor 格式的输入，你可以直接将其传入 model.generate()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
            {"type": "text", "text": "What is shown in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=50)

模型优化

通过 `bitsandbytes` 库进行 4-bit 量化

首先确保安装 bitsandbytes，pip install bitsandbytes，并确保拥有 CUDA 兼容的 GPU 设备。只需更改上面的代码片段：

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

使用 Flash-Attention 2 进一步加速生成

首先确保安装 flash-attn。请参阅 Flash Attention 原始仓库了解该包的安装方法。只需更改上面的代码片段：

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

许可证声明

本项目使用了受各自原始许可证约束的某些数据集和检查点。用户必须遵守所有原始许可证的条款和条件，包括但不限于数据集的 OpenAI 使用条款，以及使用该数据集训练检查点的基础语言模型的特定许可证 Tongyi Qianwen LICENSE AGREEMENT 和 META LLAMA 3 COMMUNITY LICENSE AGREEMENT）。本项目不施加超出原始许可证规定的任何额外约束。此外，提醒用户确保其对数据集和检查点的使用符合所有适用法律和法规。

Bibtex 引用

@misc{li2024llavanextinterleavetacklingmultiimagevideo,
      title={LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models}, 
      author={Feng Li and Renrui Zhang and Hao Zhang and Yuanhan Zhang and Bo Li and Wei Li and Zejun Ma and Chunyuan Li},
      year={2024},
      eprint={2407.07895},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.07895}, 
}

llava-hf/llava-interleave-qwen-0.5b-hf

作者 llava-hf

image-text-to-text transformers

↓ 169.9K ♥ 36

创建时间: 2024-07-10 07:19:08+00:00

更新时间: 2025-01-27 11:28:13+00:00

在 Hugging Face 上查看

文件 (39)

.gitattributes

LICENSE

README.md

added_tokens.json

chat_template.json

config.json

generation_config.json

merges.txt

model.safetensors

onnx/decoder_model_merged.onnx ONNX

onnx/decoder_model_merged_bnb4.onnx ONNX

onnx/decoder_model_merged_fp16.onnx ONNX

onnx/decoder_model_merged_int8.onnx ONNX

onnx/decoder_model_merged_q4.onnx ONNX

onnx/decoder_model_merged_q4f16.onnx ONNX

onnx/decoder_model_merged_quantized.onnx ONNX

onnx/decoder_model_merged_uint8.onnx ONNX

onnx/embed_tokens.onnx ONNX

onnx/embed_tokens_bnb4.onnx ONNX

onnx/embed_tokens_fp16.onnx ONNX

onnx/embed_tokens_int8.onnx ONNX

onnx/embed_tokens_q4.onnx ONNX

onnx/embed_tokens_q4f16.onnx ONNX

onnx/embed_tokens_quantized.onnx ONNX

onnx/embed_tokens_uint8.onnx ONNX

onnx/vision_encoder.onnx ONNX

onnx/vision_encoder_bnb4.onnx ONNX

onnx/vision_encoder_fp16.onnx ONNX

onnx/vision_encoder_int8.onnx ONNX

onnx/vision_encoder_q4.onnx ONNX

onnx/vision_encoder_q4f16.onnx ONNX

onnx/vision_encoder_quantized.onnx ONNX

onnx/vision_encoder_uint8.onnx ONNX

preprocessor_config.json

processor_config.json

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.json