返回模型

说明文档

license: other license_name: insightface-non-commercial license_link: https://github.com/deepinsight/insightface#license tags:

face-detection
face-recognition
scrfd
arcface
onnx
batch-inference
tensorrt library_name: onnx pipeline_tag: image-classification

InsightFace 批量优化模型（最大批次 64）

重新导出的 InsightFace 模型，支持真正的动态批次且无跨帧污染。

⚠️ 版本差异

仓库	最大批次	适用场景
alonsorobots/scrfd_320_batched	1-32	标准使用，经过充分测试
本仓库	1-64	大批量实验

推荐： 使用最大批次=32 以获得最佳性能。批次=64 提供相似的吞吐量，但占用更多显存。

为什么选择这些模型？

原始 InsightFace ONNX 模型在批量推理方面存在问题：

buffalo_l 检测模型：硬编码批次=1
buffalo_l_batch 检测模型：已损坏 - 由于重塑操作会压平批次维度，导致跨帧污染

这些重新导出的模型修复了 ONNX 图中的 dynamic_axes，实现了真正的批量推理。

模型

模型	任务	输入形状	输出	批次	加速比
`scrfd_10g_320_batch64.onnx`	人脸检测	`[N, 3, 320, 320]`	边界框, 关键点	1-64	6×
`arcface_w600k_r50_batch64.onnx`	人脸嵌入	`[N, 3, 112, 112]`	512维向量	1-64	10×

性能（TensorRT FP16，RTX 5090）

批次大小对比（完整视频，12,263 帧）

批次大小	FPS	相对性能
16	2,007	1.00×
32	2,097	1.05× ✅ 最优
64	2,034	1.01×

关键发现： 批次=32 是最优的。批次=64 由于 GPU 内存带宽饱和，没有提供额外的收益。

配合流水线预处理（4 个工作进程）

配置	FPS	加速比
顺序处理批次=16	1,211	基准
流水线批次=32	2,097	1.73×

使用方法

import numpy as np
import onnxruntime as ort

# 加载模型
sess = ort.InferenceSession("scrfd_10g_320_batch64.onnx", 
                            providers=["TensorrtExecutionProvider", "CUDAExecutionProvider"])

# 批量推理（支持 1-64 的任意大小）
batch = np.random.randn(32, 3, 320, 320).astype(np.float32)
outputs = sess.run(None, {"input.1": batch})

# outputs[0-2]: 每个FPN层级的分数（步长 8, 16, 32）
# outputs[3-5]: 每个FPN层级的边界框
# outputs[6-8]: 每个FPN层级的 keypoints

TensorRT 配置

使用 TensorRT 时，设置 profile shapes 以支持所需的批次范围：

providers = [
    ("TensorrtExecutionProvider", {
        "trt_fp16_enable": True,
        "trt_engine_cache_enable": True,
        "trt_profile_min_shapes": "input.1:1x3x320x320",
        "trt_profile_opt_shapes": "input.1:32x3x320x320",  # 针对批次=32优化
        "trt_profile_max_shapes": "input.1:64x3x320x320",  # 支持最大 64
    }),
    "CUDAExecutionProvider",
]

已验证：无批次污染

# 同一帧单独处理 vs 批量处理 = 完全相同的结果
single_output = sess.run(None, {"input.1": frame[np.newaxis, ...]})
batch[7] = frame
batch_output = sess.run(None, {"input.1": batch})

max_diff = np.max(np.abs(single_output[0] - batch_output[0][7]))
# max_diff < 1e-5 ✓

重新导出过程

这些模型使用 MMDetection 从 InsightFace 的 PyTorch 源代码重新导出，并正确设置了 dynamic_axes：

dynamic_axes = {
    "input.1": {0: "batch"},
    "score_8": {0: "batch"},
    "score_16": {0: "batch"},
    # ... 所有输出
}

许可证

仅限非商业研究用途 - 根据 InsightFace 许可证。

如需商业许可，请联系：recognition-oss-pack@insightface.ai

致谢

原始模型：InsightFace 由 Jia Guo 等人开发
SCRFD 论文：Sample and Computation Redistribution for Efficient Face Detection
ArcFace 论文：ArcFace: Additive Angular Margin Loss for Deep Face Recognition

alonsorobots/scrfd_320_batched_64

作者 alonsorobots

image-classification onnx

↓ 0 ♥ 0

创建时间: 2025-12-02 20:27:26+00:00

更新时间: 2025-12-02 20:27:30+00:00

在 Hugging Face 上查看

文件 (4)

.gitattributes

README.md

arcface_w600k_r50_batch64.onnx ONNX

scrfd_10g_320_batch64.onnx ONNX