说明文档

gravitee-io/bert-small-pii-detection 🚀

一款更精准的 PII 检测器，基于 prajjwal1/bert-small 在元数据中描述的数据集上微调而成。

关于数据集：

我们整合了多种数据集，以覆盖广泛的文档格式，例如：

JSON
HTML
XML
SQL
文档

标签集

AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN

如何使用

快速开始（pipeline）

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo = "gravitee-io/bert-small-pii-detection"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForTokenClassification.from_pretrained(repo)

pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
text = ""
pipe(text)

评估结果

指标： 每个实体的精确率 / 召回率 / F1，微平均/宏平均

Entity	Precision	Recall	F1-score	Support
AGE	0.9898	0.8858	0.9349	219
COORDINATE	0.9627	0.8738	0.9161	325
CREDIT_CARD	0.9273	0.8870	0.9067	115
DATE_TIME	0.8598	0.7364	0.7933	3255
EMAIL_ADDRESS	0.9428	0.8941	0.9178	387
FINANCIAL	0.9862	0.9565	0.9711	299
IBAN_CODE	0.9577	0.9252	0.9412	147
IMEI	0.9885	0.9663	0.9773	89
IP_ADDRESS	0.9338	0.8812	0.9068	160
LOCATION	0.8849	0.8222	0.8524	4264
MAC_ADDRESS	0.9889	1.0000	0.9944	89
NRP	1.0000	0.9818	0.9908	494
ORGANIZATION	0.7454	0.6688	0.7051	3551
PASSWORD	0.8384	0.8137	0.8259	102
PERSON	0.9123	0.8826	0.8972	4454
PHONE_NUMBER	0.9462	0.8199	0.8785	322
TITLE	0.9887	0.9734	0.9810	451
URL	1.0000	0.9787	0.9892	188
US_BANK_NUMBER	1.0000	0.9579	0.9785	95
US_DRIVER_LICENSE	0.9167	0.9167	0.9167	120
US_ITIN	0.9659	0.8763	0.9189	97
US_LICENSE_PLATE	1.0000	0.9000	0.9474	90
US_PASSPORT	0.9200	0.9200	0.9200	100
US_SSN	0.9744	0.9580	0.9661	119
micro avg	0.8804	0.8141	0.8460	19532
macro avg	0.9429	0.8948	0.9178	19532
weighted avg	0.8785	0.8141	0.8446	19532

预期用途与限制

适用场景：

低资源环境
在客服日志、开发/测试环境、API 追踪记录和文章中脱敏 PII
表单字段或数据录入系统中的实时提示

限制：

主要针对英语；其他语言效果会下降
领域偏移是现实问题：请基于自有数据进行审计

引用

如果您使用该模型，请考虑引用以下论文：

@misc{bhargava2021generalization,
      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
      year={2021},
      eprint={2110.01518},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{DBLP:journals/corr/abs-1908-08962,
  author    = {Iulia Turc and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
               on Knowledge Distillation},
  journal   = {CoRR},
  volume    = {abs/1908.08962},
  year      = {2019},
  url       = {http://arxiv.org/abs/1908.08962},
  eprinttype = {arXiv},
  eprint    = {1908.08962},
  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@online{WinNT,
  author = {Benjamin Kilimnik},
  title = {{Privy} Synthetic PII Protocol Trace Dataset},
  year = 2022,
  url = {https://huggingface.co/datasets/beki/privy},
}

@online{gretel2023,
  author = {Gretel.ai},
  title = {{Synthetic PII Finance Multilingual Dataset}},
  year = 2023,
  url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
}

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
    year = "2003",
    url = "https://aclanthology.org/W03-0419",
}

bnunamak/bert-small-pii-detection

作者 bnunamak

token-classification

↓ 1 ♥ 0

创建时间: 2025-11-25 14:15:11+00:00

更新时间: 2025-11-25 14:37:35+00:00

在 Hugging Face 上查看

文件 (10)

.gitattributes

README.md

config.json

model.safetensors

onnx/model.onnx ONNX

onnx/model_quantized.onnx ONNX

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.txt