说明文档

这个模型可以预测英语、意大利语、法语和德语文本的标点符号。我们开发它的目的是为了恢复转录口语的标点符号。

这个多语言模型是在 SEPP-NLG 共享任务提供的 Europarl 数据集上训练的。请注意，该数据集由政治演讲组成。因此，该模型在处理其他领域的文本时可能会有不同的表现。

该模型可以恢复以下标点符号："." "," "?" "-" ":"

示例代码

我们提供了一个简单的 Python 包，可以处理任意长度的文本。

安装

从 pypi 安装包即可开始使用：

pip install deepmultilingualpunctuation

恢复标点符号

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

输出

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

预测标签

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

输出

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

结果

各个标点符号的性能表现有所不同，因为连字符和冒号在很多情况下是可选的，可以用逗号或句号替代。该模型在不同语言上的 F1 分数如下：

标签	英语	德语	法语	意大利语
0	0.991	0.997	0.992	0.989
.	0.948	0.961	0.945	0.942
?	0.890	0.893	0.871	0.832
,	0.819	0.945	0.831	0.798
:	0.575	0.652	0.620	0.588
-	0.425	0.435	0.431	0.421
宏平均	0.775	0.814	0.782	0.762

语言

模型

语言	模型
英语、意大利语、法语和德语	oliverguhr/fullstop-punctuation-multilang-large
英语、意大利语、法语、德语和荷兰语	oliverguhr/fullstop-punctuation-multilingual-sonar-base
荷兰语	oliverguhr/fullstop-dutch-sonar-punctuation-prediction

社区模型

语言	模型
英语、德语、法语、西班牙语、保加利亚语、意大利语、波兰语、荷兰语、捷克语、葡萄牙语、斯洛伐克语、斯洛文尼亚语	kredor/punctuate-all
加泰罗尼亚语	softcatala/fullstop-catalan-punctuation-prediction
威尔士语	techiaith/fullstop-welsh-punctuation-prediction

你可以通过设置 model 参数来使用不同的模型：

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

在哪里可以找到代码，以及是否可以训练自己的模型？

当然可以！请查看这个仓库以获取研究项目的完整代码。

这里还有一份如何针对你的数据/语言微调此模型的指南。

参考文献

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

oliverguhr/fullstop-punctuation-multilang-large

作者 oliverguhr

token-classification transformers

↓ 209.2K ♥ 174

创建时间: 2022-03-02 23:29:05+00:00

更新时间: 2023-11-16 09:35:35+00:00

在 Hugging Face 上查看

文件 (18)

.gitattributes

README.md

config.json

model.safetensors

onnx/config.json

onnx/model.onnx ONNX

onnx/model.onnx_data

onnx/sentencepiece.bpe.model

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

pytorch_model.bin

sentencepiece.bpe.model

special_tokens_map.json

tf_model.h5

tokenizer.json

tokenizer_config.json

training_args.bin