ONNX 模型库
返回模型

说明文档

gravitee-io/bert-small-pii-detection 🚀

一款更精准的 PII 检测器,基于 prajjwal1/bert-small 在元数据中描述的数据集上微调而成。

关于数据集:

我们整合了多种数据集,以覆盖广泛的文档格式,例如:

  1. JSON
  2. HTML
  3. XML
  4. SQL
  5. 文档

标签集

AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN

如何使用

快速开始(pipeline)

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo = "gravitee-io/bert-small-pii-detection"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForTokenClassification.from_pretrained(repo)

pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
text = ""
pipe(text)

评估结果

指标: 每个实体的精确率 / 召回率 / F1,微平均/宏平均

Entity Precision Recall F1-score Support
AGE 0.9898 0.8858 0.9349 219
COORDINATE 0.9627 0.8738 0.9161 325
CREDIT_CARD 0.9273 0.8870 0.9067 115
DATE_TIME 0.8598 0.7364 0.7933 3255
EMAIL_ADDRESS 0.9428 0.8941 0.9178 387
FINANCIAL 0.9862 0.9565 0.9711 299
IBAN_CODE 0.9577 0.9252 0.9412 147
IMEI 0.9885 0.9663 0.9773 89
IP_ADDRESS 0.9338 0.8812 0.9068 160
LOCATION 0.8849 0.8222 0.8524 4264
MAC_ADDRESS 0.9889 1.0000 0.9944 89
NRP 1.0000 0.9818 0.9908 494
ORGANIZATION 0.7454 0.6688 0.7051 3551
PASSWORD 0.8384 0.8137 0.8259 102
PERSON 0.9123 0.8826 0.8972 4454
PHONE_NUMBER 0.9462 0.8199 0.8785 322
TITLE 0.9887 0.9734 0.9810 451
URL 1.0000 0.9787 0.9892 188
US_BANK_NUMBER 1.0000 0.9579 0.9785 95
US_DRIVER_LICENSE 0.9167 0.9167 0.9167 120
US_ITIN 0.9659 0.8763 0.9189 97
US_LICENSE_PLATE 1.0000 0.9000 0.9474 90
US_PASSPORT 0.9200 0.9200 0.9200 100
US_SSN 0.9744 0.9580 0.9661 119
micro avg 0.8804 0.8141 0.8460 19532
macro avg 0.9429 0.8948 0.9178 19532
weighted avg 0.8785 0.8141 0.8446 19532

预期用途与限制

适用场景:

  • 低资源环境
  • 在客服日志、开发/测试环境、API 追踪记录和文章中脱敏 PII
  • 表单字段或数据录入系统中的实时提示

限制:

  • 主要针对英语;其他语言效果会下降
  • 领域偏移是现实问题:请基于自有数据进行审计

引用

如果您使用该模型,请考虑引用以下论文:

@misc{bhargava2021generalization,
      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
      year={2021},
      eprint={2110.01518},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{DBLP:journals/corr/abs-1908-08962,
  author    = {Iulia Turc and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
               on Knowledge Distillation},
  journal   = {CoRR},
  volume    = {abs/1908.08962},
  year      = {2019},
  url       = {http://arxiv.org/abs/1908.08962},
  eprinttype = {arXiv},
  eprint    = {1908.08962},
  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@online{WinNT,
  author = {Benjamin Kilimnik},
  title = {{Privy} Synthetic PII Protocol Trace Dataset},
  year = 2022,
  url = {https://huggingface.co/datasets/beki/privy},
}

@online{gretel2023,
  author = {Gretel.ai},
  title = {{Synthetic PII Finance Multilingual Dataset}},
  year = 2023,
  url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
}

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
    title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
    year = "2003",
    url = "https://aclanthology.org/W03-0419",
}

bnunamak/bert-small-pii-detection

作者 bnunamak

token-classification
↓ 1 ♥ 0

创建时间: 2025-11-25 14:15:11+00:00

更新时间: 2025-11-25 14:37:35+00:00

在 Hugging Face 上查看

文件 (10)

.gitattributes
README.md
config.json
model.safetensors
onnx/model.onnx ONNX
onnx/model_quantized.onnx ONNX
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.txt