返回模型

说明文档

NCI 技术分类器 v5.2

基于 ModernBERT 的多标签宣传技术分类器，用于识别 SemEval-2020 任务 11 分类法中的 18 种宣传技术。

模型描述

该模型是 NCI（叙事协调指数）协议的一部分，用于检测协调影响力行动。它能将文本分类为 18 种宣传技术，并输出经过良好校准的概率值。

v5.2 的主要改进

降低误报率：科学/事实内容的误报率从 v4 版本的 35% 降至 8.8%
更好的校准：ASL 损失函数 with clip=0.02 提供更具区分度的概率输出
硬负样本训练：使用 v5 数据集训练，包含 1000+ 个硬负样本（科学、商业、事实内容）
文档级分析：对完整文档效果良好，无需进行句子级切分

训练详情

基础模型：answerdotai/ModernBERT-base
数据集：synapti/nci-propaganda-v5（24,037 个样本）
损失函数：非对称损失（ASL）
- gamma_neg: 4.0
- gamma_pos: 1.0
- clip: 0.02（从 0.05 降低，以减少概率偏移）
训练：3 个 epoch，lr=2e-5，batch_size=16
验证：4/7 测试通过（57%）

检测到的技术

ID	技术	描述
0	Loaded_Language	具有强烈情感含义的词语
1	Appeal_to_fear-prejudice	通过恐惧或偏见来争取支持
2	Exaggeration,Minimisation	夸大或缩小事实
3	Repetition	重复信息以加强效果
4	Flag-Waving	诉诸爱国主义/民族身份认同
5	Name_Calling,Labeling	使用标签唤起偏见
6	Reductio_ad_hitlerum	与希特勒/纳粹进行类比
7	Black-and-White_Fallacy	只呈现两种选择
8	Causal_Oversimplification	对复杂问题假设单一原因
9	Whataboutism,Straw_Men,Red_Herring	转移注意力技巧
10	Straw_Man	曲解对手的观点
11	Red_Herring	引入无关话题
12	Doubt	质疑可信度
13	Appeal_to_Authority	利用权威人物支持主张
14	Thought-terminating_Cliches	终结理性思考的短语
15	Bandwagon	“大家都在做”的诉求
16	Slogans	便于记忆的 catchy 短语
17	Obfuscation,Intentional_Vagueness,Confusion	故意的模糊语言

使用方法

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "synapti/nci-technique-classifier-v5.2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "This is OUTRAGEOUS! They are LYING to you. WAKE UP!"

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits)[0]

# Get techniques with probability > 0.5
LABELS = [
    "Loaded_Language", "Appeal_to_fear-prejudice", "Exaggeration,Minimisation",
    "Repetition", "Flag-Waving", "Name_Calling,Labeling", "Reductio_ad_hitlerum",
    "Black-and-White_Fallacy", "Causal_Oversimplification",
    "Whataboutism,Straw_Men,Red_Herring", "Straw_Man", "Red_Herring", "Doubt",
    "Appeal_to_Authority", "Thought-terminating_Cliches", "Bandwagon", "Slogans",
    "Obfuscation,Intentional_Vagueness,Confusion"
]

for i, (label, prob) in enumerate(zip(LABELS, probs)):
    if prob > 0.5:
        print(f"{label}: {prob:.1%}")

性能表现

验证结果

测试用例	v5.2	v4	状态
纯宣传内容	66.8%	70.8%	✓ 已检测
中立新闻	6.9%	5.5%	✓ 干净
SpaceX 事实性内容	3.7%	-	✓ 干净
多标签宣传	76.5%	-	✓ 已检测
混合内容	7.3%	-	-
恐惧诉求	69.9%	-	✓ 已检测
科学报告	8.8%	35.4%	✓ 干净

关键指标

科学报告误报率：8.8%（v4 为 35%）- 降低 75%
事实新闻误报率：4.6%（v4 为 29%）- 降低 84%
宣传检测：保持良好（宣传内容最大置信度 73.7%）

引用

@inproceedings{da-san-martino-etal-2020-semeval,
    title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles",
    author = "Da San Martino, Giovanni and others",
    booktitle = "Proceedings of the 14th International Workshop on Semantic Evaluation",
    year = "2020",
}

许可证

Apache 2.0

synapti/nci-technique-classifier-v5.2

作者 synapti

text-classification transformers

↓ 62.1K ♥ 0

创建时间: 2025-12-11 22:14:50+00:00

更新时间: 2025-12-11 23:09:07+00:00

在 Hugging Face 上查看

文件 (15)

.gitattributes

README.md

calibration_config.json

config.json

model.safetensors

onnx/calibration_config.json

onnx/config.json

onnx/model.onnx ONNX

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

special_tokens_map.json

tokenizer.json

tokenizer_config.json

training_args.bin