说明文档

bert-base-NER

如果我的开源模型对你有帮助，请考虑支持我为所有人构建小型实用的AI模型（并帮助我支付医学院学费/在经济上帮助我的父母）。谢谢！

模型描述

bert-base-NER 是一个经过微调的BERT模型，可直接用于命名实体识别，并在NER任务上达到了最先进的性能。它经过训练可以识别四种类型的实体：地点（LOC）、组织（ORG）、人物（PER）和杂项（MISC）。

具体来说，这个模型是基于 bert-base-cased 模型，在标准的 CoNLL-2003 命名实体识别数据集的英文版本上进行微调的。

如果你想使用在相同数据集上微调的更大的BERT-large模型，也可以获取 bert-large-NER 版本。

可用的NER模型

模型名称	描述	参数量
distilbert-NER (新!)	经过微调的DistilBERT - 一个更小、更快、更轻的BERT版本	66M
bert-large-NER	经过微调的bert-large-cased - 更大的模型，性能略好	340M
bert-base-NER-(uncased)	经过微调的bert-base，提供cased和uncased两个版本	110M

预期用途与局限性

如何使用

你可以使用Transformers的 pipeline 来进行NER任务。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"
ner_results = nlp(example)
print(ner_results)

局限性与偏见

该模型受限于其训练数据集，即特定时间段内带有实体标注的新闻文章。这可能无法很好地泛化到不同领域的所有用例。此外，该模型偶尔会将子词标记标记为实体，可能需要对结果进行后处理来处理这些情况。

训练数据

该模型在标准的 CoNLL-2003 命名实体识别数据集的英文版本上进行了微调。

训练数据集区分实体的开始和延续，这样如果有连续的相同类型实体，模型可以输出第二个实体的开始位置。与数据集中一样，每个标记将被分类为以下类别之一：

缩写	描述
O	命名实体之外
B-MISC	杂项实体的开始，紧跟在另一个杂项实体之后
I-MISC	杂项实体
B-PER	人物姓名的开始，紧跟在另一个人物姓名之后
I-PER	人物姓名
B-ORG	组织的开始，紧跟在另一个组织之后
I-ORG	组织
B-LOC	地点的开始，紧跟在另一个地点之后
I-LOC	地点

CoNLL-2003 英文数据集统计

该数据集来源于路透社语料库，包含路透社新闻故事。你可以在CoNLL-2003论文中了解更多关于该数据集的创建方式。

每种实体类型的训练样本数

数据集	LOC	MISC	ORG	PER
Train	7140	3438	6321	6600
Dev	1837	922	1341	1842
Test	1668	702	1661	1617

每个数据集的文章/句子/标记数

数据集	文章数	句子数	标记数
Train	946	14,987	203,621
Dev	216	3,466	51,362
Test	231	3,684	46,435

训练过程

该模型在单个NVIDIA V100 GPU上训练，使用了原始BERT论文中推荐的超参数，该论文在CoNLL-2003 NER任务上训练和评估了模型。

评估结果

指标	dev	test
f1	95.1	91.3
precision	95.0	90.7
recall	95.3	91.9

测试指标略低于官方Google BERT的结果，后者编码了文档上下文并尝试了CRF。更多关于复现原始结果的信息请参见此处。

BibTeX条目和引用信息

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
  title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
  author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
  booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
  year = "2003",
  url = "https://www.aclweb.org/anthology/W03-0419",
  pages = "142--147",
}

dslim/bert-base-NER

作者 dslim

token-classification transformers

↓ 1.7M ♥ 702

创建时间: 2022-03-02 23:29:05+00:00

更新时间: 2024-10-08 07:51:39+00:00

在 Hugging Face 上查看

文件 (18)

.gitattributes

README.md

added_tokens.json

config.json

flax_model.msgpack

model.safetensors

onnx/added_tokens.json

onnx/config.json

onnx/model.onnx ONNX

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

onnx/vocab.txt

pytorch_model.bin

special_tokens_map.json

tf_model.h5

tokenizer_config.json

vocab.txt