说明文档

DistilBERT 垃圾邮件分类器

一个基于 DistilBERT 微调的钓鱼邮件检测模型，在 Phishing Emails Dataset 数据集上训练。该模型针对垃圾邮件和钓鱼邮件的高精度识别进行了优化。

模型概述

基础模型：DistilBERT
微调：在钓鱼邮件数据集上执行，用于将邮件分类为垃圾邮件 (1) 或非垃圾邮件 (0)。
格式：提供 ONNX 格式以便高效部署。

架构

该模型在 DistilBERT 基础上添加了自定义分类头：

class DistilBERTSpamClassifier(nn.Module):
def __init__(self, distilbert):
super(DistilBERTSpamClassifier, self).__init__()
self.distilbert = distilbert
self.dropout = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)

输入：分词后的邮件文本（通过 DistilBERT 分词器处理）。
输出：两个类别（垃圾邮件或非垃圾邮件）的对数概率。
层结构：
DistilBERT 用于上下文嵌入（768 维）。
Dropout (0.1) 用于正则化。
全连接层 (768 → 512 → 2)，使用 ReLU 激活函数。
LogSoftmax 用于分类。

性能

在包含 3,021 个样本的测试集上评估，模型各项指标表现如下：

类别	精确率	召回率	F1分数	样本数
非垃圾邮件 (0)	0.98	0.98	0.98	1,870
垃圾邮件 (1)	0.96	0.97	0.96	1,151

准确率：97%
宏平均：精确率：0.97，召回率：0.97，F1分数：0.97
加权平均：精确率：0.97，召回率：0.97，F1分数：0.97

使用方法

安装依赖：

pip install transformers onnxruntime torch

加载模型：使用兼容的推理引擎（如 ONNX Runtime）加载 ONNX 模型。示例：

from transformers import DistilBertTokenizer
import onnxruntime as ort
import numpy as np

# 加载分词器
tokenizer = DistilBertTokenizer.from_pretrained(\"distilbert-base-uncased\")

# 加载 ONNX 模型
session = ort.InferenceSession(\"path_to_model.onnx\")

# 分词输入
text = \"Your example email text here\"
inputs = tokenizer(text, return_tensors=\"np\", padding=True, truncation=True, max_length=512)

# 运行推理
outputs = session.run(None, dict(inputs))[0]
prediction = np.argmax(outputs, axis=1)
print(\"Spam\" if prediction == 1 else \"Non-Spam\")

输入要求：

文本输入必须使用 DistilBERT 分词器进行分词。
最大序列长度：512 个标记。

数据集

该模型在 Phishing Emails Dataset 上进行微调，该数据集包含用于垃圾邮件和钓鱼邮件检测的已标注邮件样本。

局限性

仅提供 ONNX 格式；没有 PyTorch 或 TensorFlow 检查点。
最大输入长度为 512 个标记；更长的邮件会被截断。

masterburator3301/distilbert-spam-phishing-classification-onnx

作者 masterburator3301

text-classification

↓ 0 ♥ 0

创建时间: 2025-04-26 07:48:32+00:00

更新时间: 2025-06-13 13:32:12+00:00

在 Hugging Face 上查看

文件 (7)

.gitattributes

README.md

saved_weights_distilbert4.pt

text_classifier.onnx ONNX

tokenizer.json

tokenizer_config.json

vocab.txt