说明文档

DeepSeek-R1-Distill-Qwen ONNX 模型

https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX/resolve/main/deepseek-r1-distill-qwen-1.5B/gpu/gpu-int4-rtn-block-32/

本仓库托管了 DeepSeek-R1-Distill-Qwen-1.5B 和 DeepSeek-R1-Distill-Qwen-7B 的优化版本，以加速 ONNX Runtime 的推理。优化后的模型以 ONNX 格式发布，可通过 ONNX Runtime 在 CPU 和 GPU 上跨设备运行，包括服务器平台、Windows、Linux 和 Mac 桌面以及移动 CPU，并针对每个目标采用最适合的精度。

为了轻松上手使用该模型，您可以使用我们的 ONNX Runtime Generate() API。请参阅此处的说明这里

CPU 版本：

# 使用 Hugging Face CLI 直接下载模型
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/cpu_and_mobile/* --local-dir .

# 安装 ONNX Runtime GenAI 的 CPU 包
pip install onnxruntime-genai

# 请相应调整模型目录
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m /path/to/cpu-int4-rtn-block-32-acc-level-4/ -e cpu --chat_template \"<|begin▁of▁sentence|><|User|>{input}<|Assistant|>\"

CUDA 版本：

# 使用 Hugging Face CLI 直接下载模型
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/gpu/* --local-dir .

# 安装 ONNX Runtime GenAI 的 CUDA 包
pip install onnxruntime-genai-cuda

# 请相应调整模型目录
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m /path/to/gpu-int4-rtn-block-32/ -e cuda --chat_template \"<|begin▁of▁sentence|><|User|>{input}<|Assistant|>\"

DirectML 版本：

# 使用 Hugging Face CLI 直接下载模型
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/gpu/* --local-dir .

# 安装 ONNX Runtime GenAI 的 DirectML 包
pip install onnxruntime-genai-directml

# 请相应调整模型目录
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m /path/to/gpu-int4-rtn-block-32/ -e dml --chat_template \"<|begin▁of▁sentence|><|User|>{input}<|Assistant|>\"

ONNX 模型

以下是我们添加的一些优化配置：

通过 RTN 进行 int4 量化的 CPU 和移动端 ONNX 模型。
通过 RTN 进行 int4 量化的 GPU ONNX 模型。

性能

ONNX 使您能够在 CPU、GPU、NPU 上本地运行模型。借助 ONNX，您可以在任何机器上跨所有芯片（Qualcomm、AMD、Intel、Nvidia 等）运行模型。

请参阅下表，了解 ONNX 模型测试所用的 Windows GPU 和 CPU 设备的一些关键基准测试结果。

模型	精度	设备类型	执行提供程序	设备	Token 生成吞吐量	相比基础模型的加速比
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B	ONNX	fp16	CUDA	RTX 4090	197.195	4X
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B	ONNX	int4	CUDA	RTX 4090	313.32	6.3X
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B	ONNX	int4	CPU	Intel i9	11.749	1.4x
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B	ONNX	fp16	CUDA	RTX 4090	57.316	1.3X
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B	ONNX	int4	CUDA	RTX 4090	161.00	3.7X
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B	ONNX	int4	CPU	Intel i9	3.184	20X

CPU 构建规格：

onnxruntime-genai==0.6.0-dev
transformers==4.46.2
onnxruntime==1.20.01

CUDA 构建规格：

onnxruntime-genai-cuda==0.6.0-dev
transformers==4.46.2
onnxruntime-gpu==1.20.1

模型描述

开发者： ONNX Runtime
模型类型： ONNX
语言（NLP）： Python, C, C++
许可证： MIT
模型描述： 这是 Deepseek R1 的 ONNX Runtime 推理转换版本。
免责声明： 该模型仅为基础模型的优化版本，任何与该模型相关的风险由模型用户自行承担。请针对您的场景进行验证和测试。应用优化后，输出可能与基础模型略有差异。 **

基础模型信息

详情请参阅 HF 链接 DeepSeek-R1-Distill-Qwen-1.5B 和 DeepSeek-R1-Distill-Qwen-7B。

webnn/DeepSeek-R1-Distill-ONNX

作者 webnn

text-generation transformers

↓ 0 ♥ 0

创建时间: 2025-05-07 01:42:36+00:00

更新时间: 2025-05-07 02:16:11+00:00

在 Hugging Face 上查看

文件 (10)

.gitattributes

README.md

genai_config.json

model.onnx ONNX

model.onnx.data

onnx/model.onnx.data

onnx/model_fp16.onnx ONNX

special_tokens_map.json

tokenizer.json

tokenizer_config.json