说明文档

Web 攻击检测模型

一个基于 CodeBERT 的深度学习模型，用于检测恶意 Web 请求和攻击载荷。该模型可以识别 SQL 注入、XSS、路径遍历、命令注入以及其他常见的 Web 攻击模式。

模型描述

该模型基于 microsoft/codebert-base 进行微调，用于对 Web 请求进行二分类，判断其为良性或恶意。

模型架构

基础模型：CodeBERT（RoBERTa-base 架构）
任务：二分类文本分类
参数量：124.6M
最大序列长度：256 tokens

性能指标

指标	训练集	测试集 (125K)	2000 样本测试
准确率	99.30%	99.38%	99.60%
精确率	-	99.47%	99.80%
召回率	-	99.21%	99.40%
F1 分数	-	99.34%	99.60%

混淆矩阵（测试集）

	预测为良性	预测为恶意
实际良性	65,914	312
实际恶意	464	58,491

训练详情

数据集

总样本数：625,904
训练样本数：500,722（80%）
测试样本数：125,181（20%）
类别分布：平衡（47% 恶意，53% 良性）
采样策略：使用 WeightedRandomSampler 进行平衡采样

训练配置

参数	值
训练轮数	3
批大小	8
梯度累积步数	4
有效批大小	32
学习率	2e-5
预热步数	500
权重衰减	0.01
最大序列长度	256
优化器	AdamW

训练进度

训练轮数	训练损失	训练准确率	测试损失	测试准确率	F1 分数
1	0.0289	98.84%	0.0192	99.09%	0.9904
2	0.0201	99.24%	0.0169	99.08%	0.9903
3	0.0175	99.30%	0.0274	99.38%	0.9934

硬件

GPU：NVIDIA Tesla T4（16GB）
训练时间：约 24 小时

模型文件

文件	大小	描述
`best_model.pt`	1.4 GB	PyTorch 检查点（全精度）
`model.onnx`	476 MB	ONNX 模型（全精度）
`model_quantized.onnx`	120 MB	ONNX 模型（INT8 量化）

使用方法

使用 ONNX Runtime 快速开始

import numpy as np
import onnxruntime as ort
from transformers import RobertaTokenizer

# 加载分词器和模型
tokenizer = RobertaTokenizer.from_pretrained(\"microsoft/codebert-base\")
session = ort.InferenceSession(\"model_quantized.onnx\", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

# 预测
def predict(payload: str) -> dict:
    inputs = tokenizer(
        payload,
        max_length=256,
        padding='max_length',
        truncation=True,
        return_tensors='np'
    )
    
    outputs = session.run(
        None,
        {
            'input_ids': inputs['input_ids'].astype(np.int64),
            'attention_mask': inputs['attention_mask'].astype(np.int64)
        }
    )
    
    probs = outputs[0][0]
    pred_idx = np.argmax(probs)
    
    return {
        \"prediction\": \"malicious\" if pred_idx == 1 else \"benign\",
        \"confidence\": float(probs[pred_idx]),
        \"probabilities\": {
            \"benign\": float(probs[0]),
            \"malicious\": float(probs[1])
        }
    }

# 示例用法
result = predict(\"SELECT * FROM users WHERE id=1 OR 1=1--\")
print(result)
# {'prediction': 'malicious', 'confidence': 0.9355, 'probabilities': {'benign': 0.0645, 'malicious': 0.9355}}

使用 PyTorch

import torch
import torch.nn as nn
from transformers import RobertaTokenizer, RobertaModel

class CodeBERTClassifier(nn.Module):
    def __init__(self, model_path=\"microsoft/codebert-base\", num_labels=2, dropout=0.1):
        super().__init__()
        self.codebert = RobertaModel.from_pretrained(model_path)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.codebert.config.hidden_size, num_labels)
    
    def forward(self, input_ids, attention_mask):
        outputs = self.codebert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

# 加载模型
device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")
model = CodeBERTClassifier()
model.load_state_dict(torch.load(\"best_model.pt\", map_location=device))
model.eval()
model.to(device)

# 加载分词器
tokenizer = RobertaTokenizer.from_pretrained(\"microsoft/codebert-base\")

# 预测
def predict(payload: str) -> dict:
    inputs = tokenizer(
        payload,
        max_length=256,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    ).to(device)
    
    with torch.no_grad():
        logits = model(inputs['input_ids'], inputs['attention_mask'])
        probs = torch.softmax(logits, dim=-1)[0]
    
    pred_idx = torch.argmax(probs).item()
    
    return {
        \"prediction\": \"malicious\" if pred_idx == 1 else \"benign\",
        \"confidence\": probs[pred_idx].item()
    }

# 示例
result = predict(\"<script>alert('xss')</script>\")
print(result)
# {'prediction': 'malicious', 'confidence': 0.9998}

FastAPI 服务器

安装

pip install onnxruntime-gpu transformers fastapi uvicorn pydantic numpy

启动服务器

# GPU 模式（推荐）
python server_onnx.py --device gpu --quantized --port 8000

# CPU 模式
python server_onnx.py --device cpu --quantized --port 8000

API 端点

健康检查

curl http://localhost:8000/health

单条预测

curl -X POST http://localhost:8000/predict \
  -H \"Content-Type: application/json\" \
  -d '{\"payload\": \"SELECT * FROM users WHERE id=1 OR 1=1--\"}'

响应：

{
  \"payload\": \"SELECT * FROM users WHERE id=1 OR 1=1--\",
  \"prediction\": \"malicious\",
  \"confidence\": 0.9355,
  \"probabilities\": {\"benign\": 0.0645, \"malicious\": 0.9355},
  \"inference_time_ms\": 15.23
}

批量预测

curl -X POST http://localhost:8000/batch_predict \
  -H \"Content-Type: application/json\" \
  -d '{\"payloads\": [\"<script>alert(1)</script>\", \"GET /api/users HTTP/1.1\"]}'

Docker 部署

GPU 版本

# Dockerfile
FROM nvidia/cuda:11.8-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install onnxruntime-gpu transformers fastapi uvicorn pydantic numpy

WORKDIR /app
COPY model_quantized.onnx ./models/
COPY server_onnx.py .

EXPOSE 8000
CMD [\"python3\", \"server_onnx.py\", \"--device\", \"gpu\", \"--quantized\"]

CPU 版本

# Dockerfile.cpu
FROM python:3.10-slim

RUN pip install onnxruntime transformers fastapi uvicorn pydantic numpy

WORKDIR /app
COPY model_quantized.onnx ./models/
COPY server_onnx.py .

EXPOSE 8000
CMD [\"python\", \"server_onnx.py\", \"--device\", \"cpu\", \"--quantized\"]

Docker Compose

version: '3.8'
services:
  web-attack-detector:
    build: .
    ports:
      - \"8000:8000\"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

可检测的攻击类型

该模型可以检测多种 Web 攻击模式，包括：

攻击类型	示例
SQL 注入	`' OR '1'='1' --`
跨站脚本攻击（XSS）	`<script>alert(document.cookie)</script>`
路径遍历	`../../etc/passwd`
命令注入	`; cat /etc/passwd`
LDAP 注入	`)(uid=))(
XML 注入	`<?xml version=\"1.0\"?><!DOCTYPE foo>`
服务端模板注入	`{{7*7}}`

局限性

该模型基于特定攻击模式训练，可能无法检测新型或混淆的攻击
最大输入长度为 256 个 tokens，更长的载荷将被截断
该模型可能会对类似攻击模式的合法请求产生误报
在不同类型的 Web 应用上性能可能会有所差异

伦理考量

该模型仅用于防御性安全目的，包括：

Web 应用防火墙（WAF）
入侵检测系统（IDS）
安全监控和告警
渗透测试和安全评估

请勿将此模型用于恶意目的。

许可证

该模型采用 MIT 许可证发布。

引用

如果您在研究或应用中使用该模型，请引用：

@misc{web-attack-detection-codebert,
  author = {Your Name},
  title = {Web Attack Detection Model based on CodeBERT},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/your-username/web-attack-detection}},
  note = {Fine-tuned CodeBERT model for detecting malicious web requests}
}

@article{feng2020codebert,
  title = {CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
  author = {Feng, Zhangyin and Guo, Daya and Tang, Duyu and Duan, Nan and Feng, Xiaocheng and Gong, Ming and Shou, Linjun and Qin, Bing and Liu, Ting and Jiang, Daxin and Zhou, Ming},
  journal = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year = {2020},
  pages = {1536--1547},
  doi = {10.18653/v1/2020.findings-emnlp.139}
}

@article{liu2019roberta,
  title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
  author = {Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal = {arXiv preprint arXiv:1907.11692},
  year = {2019}
}

致谢

Microsoft CodeBERT 提供预训练模型
Hugging Face Transformers 提供模型框架
ONNX Runtime 提供高效推理

redauzhang/common-injection-payload-classfication

作者 redauzhang

text-classification

↓ 0 ♥ 1

创建时间: 2025-12-17 02:32:09+00:00

更新时间: 2025-12-17 03:14:41+00:00

在 Hugging Face 上查看

文件 (14)

.gitattributes

README.md

best_model.pt

export_onnx_quantized.py

model.onnx ONNX

model_quantized.onnx ONNX

onnx_config.json

requirements_onnx.txt

requirements_onnx_cpu.txt

server_onnx.py

test_onnx_accuracy.py

tokenizer_config.json

train_new_model.py

training_results.json