说明文档
I detect translation intent — user wants the README translated to Chinese (Simplified) with code/links/formatting preserved. This is straightforward translation work.
Let me create a todo and proceed with the translation.
模型概述
该模型接受小写的、无标点的英文文本作为输入,并在一遍处理中完成标点恢复、真正大小写(首字母大写)和句子边界检测(分句)。
与许多类似模型不同,该模型可以通过特殊的"缩写词"类别预测带标点的缩写词(例如"U.S."),以及通过多标签真正大小写预测来处理任意大小写的单词(如NATO、McDonald's等)。
组件注意:文本生成组件似乎不尊重换行符。
相反,管道会在模型预测句子边界(换行符)的位置插入换行符\n。
使用方法
使用该模型的简单方法是安装punctuators:
pip install punctuators
如果这个包有问题,请在社区标签页告诉我(我为每个模型更新它,经常弄坏它!)
让我为我的周末总结以及一些我编造或在维基百科上找到的包含缩写词和缩写的有趣句子添加标点:
<details open>
<summary>使用示例</summary>
from typing import List
from punctuators.models import PunctCapSegModelONNX
# 实例化此模型
# 这将下载ONNX和SPE模型。清理时,从HF缓存目录中删除此模型。
m = PunctCapSegModelONNX.from_pretrained("pcs_en")
# 定义一些要添加标点的输入文本
input_texts: List[str] = [
# literally是我的周末
"i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends",
"despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live",
"when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars",
# 我编造的句子中的真实缩写词
"george hw bush was the president of the us for 8 years",
"i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter",
# 看看模型在编造的缩写词上的表现
"i went to the fgw store and bought a new tg optical scope",
# 维基百科今日特色文章摘要的前几句
"it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality"
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
print(f"Input: {input_text}")
print(f"Outputs:")
for text in output_texts:
print(f"\t{text}")
print()
根据模型版本,确切输出可能有所不同;以下是当前输出:
</details>
<details open>
<summary>预期输出</summary>
In: i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends
Out: I woke up at 6 a.m. and took the dog for a hike in the Metacomet Mountains.
Out: We like to take morning adventures on the weekends.
In: despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live
Out: Despite being mid March, it snowed overnight and into the morning.
Out: Here in Connecticut, it was snowier up in the mountains than in the Farmington Valley where I live.
In: when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars
Out: When I got home, I trained this model on the Lambda Cloud.
Out: On an A100 GPU with about 10 million lines of text, the total budget was less than 5 dollars.
In: george hw bush was the president of the us for 8 years
Out: George H.W. Bush was the president of the U.S. for 8 years.
In: i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter
Out: I saw Mr. Smith at the store he was shopping for a new lawn mower.
Out: I suggested he get one of those new battery operated ones.
Out: They're so much quieter.
In: i went to the fgw store and bought a new tg optical scope
Out: I went to the FGW store and bought a new TG optical scope.
In: it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality
Out: It's that man again.
Out: ITMA was a radio comedy programme that was broadcast by the BBC for Twelve Series from 1939 to 1949, featuring Tommy Handley.
Out: In the central role, ITMA was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations.
Out: Parts of the scripts were rewritten in the hours before the broadcast to ensure topicality.
</details>
模型详情
该模型实现了下图所示的图结构,后面是每个步骤的简要描述。

-
编码: 模型首先使用子词分词器对文本进行分词。 这里使用的分词器是一个SentencePiece模型,词汇量为32k。 然后,使用一个基础大小的Transformer对输入序列进行编码,该Transformer由6层组成,模型维度为512。
-
标点: 然后将编码后的序列输入前馈分类网络以预测标点符号。 标点预测每个子词一次,以正确地为缩写词添加标点。 逐子词预测的一个间接好处是允许模型在适用于连续脚本语言(例如中文)的图中运行。
-
句子边界检测 对于句子边界检测,我们通过嵌入来对模型进行条件处理。 每个标点预测用于选择该标记的嵌入,并将其与编码表示连接起来。 SBD头同时分析未加标点序列的编码和标点预测,并预测哪些标记是句子边界。
-
移位并连接句子边界 在英语中,每个句子的第一个字符应该大写。 因此,我们应该将句子边界信息提供给真正大小写分类网络。 由于真正大小写分类网络是前馈的且没有时间上下文,每个时间步必须嵌入它是否是句子的第一个词。 因此,我们将二元句子边界决策向右移位一个:如果标记N-1是句子边界,则标记N是句子的第一个词。 将其与编码文本连接后,每个时间步都包含SBD头预测的它是否是句子的第一个词。
-
真正大小写预测 掌握了标点和句子边界的知识后,分类网络预测真正大小写。 由于真正大小写应该按字符进行,分类网络对每个标记进行N次预测,其中N是子标记的长度。 (实际上,N是最长的子词,忽略额外的预测)。 该方案可以捕获缩写词,例如"NATO",以及双大写单词,例如"MacDonald"。
该模型的最大长度为256个子标记,因为训练嵌入的长度有限。 然而,如上所述的punctuators包将透明地对长输入的重叠子段进行预测,并在返回输出之前融合结果, 允许输入任意长度。
标点符号
该模型预测以下标点符号集:
| 标记 | 描述 |
|---|---|
| NULL | 不预测标点 |
| ACRONYM | 此子词中的每个字符以句号结尾 |
| . | 拉丁句号 |
| , | 拉丁逗号 |
| ? | 拉丁问号 |
训练详情
训练框架
该模型在NeMo框架的一个分支上训练。
训练数据
该模型使用WMT的新闻爬取数据进行训练。
使用了约1000万行来自2021年和2012年的数据。 后者用于尝试减少偏差:年度新闻通常由少数主题主导,而2021年由COVID讨论主导。
局限性
领域
该模型在新闻数据上训练,可能在对话或非正式数据上表现不佳。
噪声训练数据
训练数据是噪声的,没有使用手动清理。
缩写词和缩写
缩写词和缩写尤其噪声;下表显示了训练数据中每个标记出现了多少种变体。
| 标记 | 计数 |
|---|---|
| Mr | 115232 |
| Mr. | 108212 |
| 标记 | 计数 |
|---|---|
| U.S. | 85324 |
| US | 37332 |
| U.S | 354 |
| U.s | 108 |
| u.S. | 65 |
因此,模型的缩写词和缩写预测可能有点不可预测。
句子边界检测目标
句子边界检测目标的假设是输入数据的每一行恰好是一个句子。 然而,相当一部分训练数据每行包含多个句子。 因此,SBD头可能会漏掉一个明显的句子边界,如果它类似于训练数据中看到的错误。
评估
在这些指标中,请记住
-
数据是噪声的
-
句子边界和真正大小写以预测的标点为条件,这是最困难的任务,有时不正确。 当以参考标点为条件时,真正大小写和SBD指标相对于参考目标要高得多。
-
标点可能是主观的。例如:
Hello Frank, how's it going?或
Hello Frank. How's it going?当句子更长、更实际,这些歧义大量存在并影响所有三个分析。
测试数据和示例生成
每个测试示例使用以下过程生成:
- 连接10个随机句子
- 将连接的句子小写
- 移除所有标点
数据是新闻爬取数据的保留部分,已去重。 使用了3000行数据,生成了3000个唯一的10句示例。
结果
<details open>
<summary>标点报告</summary>
label precision recall f1 support
<NULL> (label_id: 0) 98.83 98.49 98.66 446496
<ACRONYM> (label_id: 1) 74.15 94.26 83.01 697
. (label_id: 2) 90.64 92.99 91.80 30002
, (label_id: 3) 77.19 79.13 78.15 23321
? (label_id: 4) 76.58 74.56 75.56 1022
-------------------
micro avg 97.21 97.21 97.21 501538
macro avg 83.48 87.89 85.44 501538
weighted avg 97.25 97.21 97.23 501538
</details>
<details open>
<summary>真正大小写报告</summary>
# With predicted punctuation (not aligned with targets)
label precision recall f1 support
LOWER (label_id: 0) 99.76 99.72 99.74 2020678
UPPER (label_id: 1) 93.32 94.20 93.76 83873
-------------------
micro avg 99.50 99.50 99.50 2104551
macro avg 96.54 96.96 96.75 2104551
weighted avg 99.50 99.50 99.50 2104551
# With reference punctuation (punctuation matches targets)
label precision recall f1 support
LOWER (label_id: 0) 99.83 99.81 99.82 2020678
UPPER (label_id: 1) 95.51 95.90 95.71 83873
-------------------
micro avg 99.66 99.66 99.66 2104551
macro avg 97.67 97.86 97.76 2104551
weighted avg 99.66 99.66 99.66 2104551
</details>
<details open>
<summary>句子边界检测报告</summary>
# With predicted punctuation (not aligned with targets)
label precision recall f1 support
NOSTOP (label_id: 0) 99.59 99.45 99.52 471608
FULLSTOP (label_id: 1) 91.47 93.53 92.49 29930
-------------------
micro avg 99.09 99.09 99.09 501538
macro avg 95.53 96.49 96.00 501538
weighted avg 99.10 99.09 99.10 501538
# With reference punctuation (punctuation matches targets)
label precision recall f1 support
NOSTOP (label_id: 0) 100.00 99.97 99.98 471608
FULLSTOP (label_id: 1) 99.63 99.93 99.78 32923
-------------------
micro avg 99.97 99.97 99.97 504531
macro avg 99.81 99.95 99.88 504531
weighted avg 99.97 99.97 99.97 504531
</details>
有趣的事实
本节探讨一些有趣的事实。
嵌入
让我们检查一下嵌入(见上图)看看模型是否有意义地使用了它们。
这里我们展示每个标记嵌入之间的余弦相似度:
| NULL | ACRONYM | . | , | ? | |
|---|---|---|---|---|---|
| NULL | 1.00 | ||||
| ACRONYM | -0.49 | 1.00 | |||
| . | -1.00 | 0.48 | 1.00 | ||
| , | 1.00 | -0.48 | -1.00 | 1.00 | |
| ? | -1.00 | 0.49 | 1.00 | -1.00 | 1.00 |
回想一下,这些嵌入用于预测句子边界……因此我们应该期望句号聚集。
确实,我们看到NULL和","完全相同,因为它们对句子边界都没有影响。
接下来,我们看到"."和"?"完全相同,因为相对于SBD它们完全相同:强烈的句号暗示。
(不过,我们可能期望这些标记之间有一些差异,因为"."是在缩写词后预测的,例如不是句号的"Mr."。)
此外,我们看到"."和"?"与NULL完全相反。
这是预期的,因为这些标记通常暗示句子边界,而NULL和","从未如此。
最后,我们看到ACRONYM与句号"."和"?"相似但不相同,
并且与NULL和","相距甚远但不是相反。
直觉表明这是因为缩写词可以是句号("I live in the northern U.S. It's cold here.")或不是("It's 5 a.m. and I'm tired.")。
Done. The README has been translated to Chinese (Simplified) with all code blocks, links, and markdown formatting preserved.
1-800-BAD-CODE/punctuation_fullstop_truecase_english
作者 1-800-BAD-CODE
创建时间: 2023-03-11 22:21:22+00:00
更新时间: 2023-03-19 21:35:48+00:00
在 Hugging Face 上查看