说明文档

简介

本仓库托管了 Mistral-7B-Instruct-v0.2 的优化版本，用于通过 ONNX Runtime CUDA 执行提供程序加速推理。

请参阅使用说明，了解如何使用本仓库中托管的 ONNX 文件对此模型进行推理。

模型描述

开发者: MistralAI
模型类型: 预训练生成式文本模型
许可证: Apache 2.0 许可证
模型描述: 这是 Mistral-7B-Instruct-v0.2 的转换版本，用于带有 ROCM/MiGraphx 执行提供程序的 ONNX Runtime 推理。
提供的格式: ONNX-FP32

使用示例（如果你或你爸爸很有钱的话）：

按照基准测试说明进行操作。示例步骤：

克隆 onnxruntime 仓库。
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime

安装所需依赖
python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt

使用手动模型 API 进行推理，或使用 Hugging Face 的 ORTModelForCausalLM
from optimum.onnxruntime import ORTModelForCausalLM
from onnxruntime import InferenceSession
from transformers import AutoConfig, AutoTokenizer

sess = InferenceSession("model.onnx", providers = ["CUDAExecutionProvider"]) //CUDAExecutionProvider 用于 cuda，rocm 用 ROCMExecutionProvider 或 MIGRAPHXExecutionProvider
config = AutoConfig.from_pretrained("Mistral-7B-Instruct-v0.2-onnx-fp32/") //tokenizer.json 的位置

model = ORTModelForCausalLM(sess, config, use_cache = True, use_io_binding = True)

tokenizer = AutoTokenizer.from_pretrained("Mistral-7B-Instruct-v0.2-onnx-fp32") //model.onnx 或 model_optimized.onnx 的位置

inputs = tokenizer("Instruct: 什么是费米悖论？\nOutput:", return_tensors="pt")

outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

示例（如果你或你爸爸并不有钱）：

我们从这里开始：

https://www.youtube.com/watch?v=1SJeLcI8otk&list=PLLruToFvdJEHh7tOTwvV4jjrvGu7syWdb&index=14&pp=iAQB https://www.youtube.com/watch?v=NpM0n6xBbrA&list=PLLruToFvdJEHh7tOTwvV4jjrvGu7syWdb&index=22&pp=iAQB

现在我们已经学会了主机端（即 CPU 端）的工作原理， https://www.youtube.com/watch?v=zfru8aHZ44M&list=PL5Q2soXY2Zi-qSKahS4ofaEwYl7_qp9mw&index=2&pp=iAQB https://www.youtube.com/watch?v=xz9DO-4Pkko&pp=ygUwZXRoIHp1cmljaCBjb21wdXRlciBhcmNoaXRlY3R1cmUgZ3B1IHByb2dyYW1taW5n 我们学习了 GPU 编程。

现在我们前往：对于 ROCm： https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html 对于 CUDA： https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html

现在我们要弄清楚如何使用统一内存，同时在不牺牲单一构建速度的情况下。然后我们前往： Pytorch： CUDA： https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp HIP： https://github.com/pytorch/pytorch/blob/main/c10/hip/HIPCachingAllocator.cpp（这个文件不存在，它会自动生成，请按照 git 上的说明操作）。

ONNX： CUDA || ROCM： https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/core/providers

我们找到并在统一内存结构中修改我们的代码。我们编译，我们成功，我们不会遇到内存不足的错误。砰！这是除你之外所有人的奇迹和贡献。你没有靠爸爸的钱就做到了。

现在按照爸爸有钱那部分的说明操作

aless2212/Mistral-7B-Instruct-v0.2-onnx-fp32

作者 aless2212

text-generation transformers

↓ 1 ♥ 0

创建时间: 2024-03-20 08:27:28+00:00

更新时间: 2024-03-23 20:44:14+00:00

在 Hugging Face 上查看

文件 (12)

.gitattributes

README.md

_model_layers.0_self_attn_rotary_emb_Constant_5_attr__value

_model_layers.0_self_attn_rotary_emb_Constant_attr__value

config.json

generation_config.json

model.onnx ONNX

model.onnx_data

special_tokens_map.json

tokenizer.json

tokenizer.model

tokenizer_config.json