返回模型

说明文档

Ettin：配对编码器和解码器的开源套件

🎯 摘要：最先进的配对编码器和解码器模型（17M-1B 参数），采用相同训练方式以实现公平比较。编码器超越 ModernBERT。解码器超越 Llama 3.2/SmolLM2。

📄 论文 | 🚀 GitHub 仓库

该模型是 Ettin 套件的一部分——首个使用相同数据、架构和训练方案训练的配对纯编码器和纯解码器模型集合。Ettin 实现了在多个规模上对编码器和解码器架构进行公平比较，为其各自规模类别中的开源数据模型提供了最先进的性能。

性能亮点

编码器任务（对比 ModernBERT）

GLUE 平均分：88.9 vs 88.4（Base），90.8 vs 90.4（Large）
MTEB v2 英语检索：45.7 vs 43.9（Base），48.4 vs 47.0（Large）
代码搜索和长上下文：在 CodeSearchNet 和 MLDR 上表现优异

解码器任务（对比 SmolLM2 & Llama 3.2）

平均得分：46.2 vs 45.2（SmolLM2-135M）
1B 模型：59.0 vs 56.6（Llama 3.2-1B）
生成任务：在所有模型规模上具有竞争力

关键发现

架构特定优势持续存在：400M 编码器在分类任务上优于 1B 解码器，而 400M 解码器在生成任务上优于 1B 编码器。

快速开始

安装

如果您尚未安装，可以使用 NPM 安装 Transformers.js JavaScript 库：

npm i @huggingface/transformers

使用方法

import { pipeline } from "@huggingface/transformers";

const unmasker = await pipeline("fill-mask", "onnx-community/ettin-encoder-32m-ONNX");
const result = await unmasker("The capital of France is .");
console.log(result);
// [
//   { score: 0.5151872038841248, token: 7785, token_str: ' Paris', sequence: 'The capital of France is Paris.' },
//   { score: 0.033725105226039886, token: 42268, token_str: ' Lyon', sequence: 'The capital of France is Lyon.' },
//   { score: 0.031234024092555046, token: 23397, token_str: ' Nancy', sequence: 'The capital of France is Nancy.' },
//   { score: 0.02075139433145523, token: 30167, token_str: ' Brussels', sequence: 'The capital of France is Brussels.' },
//   { score: 0.018962178379297256, token: 31955, token_str: ' Geneva', sequence: 'The capital of France is Geneva.' }
// ]

模型描述

Ettin 模型旨在为比较纯编码器和纯解码器架构提供基础。与以往受限于不同训练数据、架构和方案的比较不同，Ettin 模型采用：

相同的训练数据——所有模型使用相同的高质量混合数据
开放训练数据——数据现已公开，包含 250 多个检查点中每个检查点的批次级训练数据
匹配的架构——仅在注意力模式（双向 vs 因果）和训练目标（MLM vs CLM）上有所不同
一致的训练方案——三阶段训练，共 2T tokens
多种规模——从 17M 到 1B 参数

这种方法允许在编码器和解码器模型之间进行真正的同类比较，揭示每种架构的固有优势。

训练数据

训练数据公开可用，分为不同阶段：

预训练数据：jhu-clsp/ettin-pretraining-data——1.7T tokens 的多样化数据混合
中期训练/扩展数据：jhu-clsp/ettin-extension-data——250B tokens 的高质量过滤数据
衰减阶段数据：jhu-clsp/ettin-decay-data——100B tokens 的优质数据源
训练数据顺序：jhu-clsp/ettin-data-order——批次级训练顺序（列：input_ids, step）

模型家族

编码器模型

规模	模型	参数量	适用场景
XXS	ettin-encoder-17m	17M	移动/边缘设备
XS	ettin-encoder-32m	32M	快速推理
Small	ettin-encoder-68m	68M	平衡性能
Base	ettin-encoder-150m	150M	标准用例
Large	ettin-encoder-400m	400M	高精度需求
XL	ettin-encoder-1b	1B	最佳性能

解码器模型

规模	模型	参数量	适用场景
XXS	ettin-decoder-17m	17M	轻量级生成
XS	ettin-decoder-32m	32M	快速原型开发
Small	ettin-decoder-68m	68M	高效生成
Base	ettin-decoder-150m	150M	标准生成
Large	ettin-decoder-400m	400M	高质量生成
XL	ettin-decoder-1b	1B	最佳生成

交叉目标模型

这些模型展示了当您继续将编码器作为解码器训练（反之亦然）时会发生什么。重要提示：使用它们转换后的架构加载这些模型，而非其原始架构。

从解码器训练的编码器（解码器 → MLM）

使用 AutoModel 或 AutoModelForMaskedLM 作为编码器加载：

规模	模型	参数量	描述
XXS	ettin-encoder-from-decoder-17m	17M	解码器 → MLM 继续训练
XS	ettin-encoder-from-decoder-32m	32M	解码器 → MLM 继续训练
Small	ettin-encoder-from-decoder-68m	68M	解码器 → MLM 继续训练
Base	ettin-encoder-from-decoder-150m	150M	解码器 → MLM 继续训练
Large	ettin-encoder-from-decoder-400m	400M	解码器 → MLM 继续训练
XL	ettin-encoder-from-decoder-1b	1B	解码器 → MLM 继续训练

🔬 研究应用

Ettin 的独特之处

Ettin 提供了首个编码器与解码器架构的对照比较：

相同的训练数据：所有模型使用相同的 2T token 混合数据
匹配的架构：仅注意力模式和目标不同
全面开放：训练数据、模型权重和批次级训练顺序
多种规模：从 17M 到 1B 参数的公平比较
250+ 检查点：完整的训练轨迹分析

研究人员用例

架构研究：公平比较编码器与解码器能力
训练动态：使用批次级数据顺序分析 250+ 个检查点
缩放定律：研究架构优势如何随规模变化
迁移学习：研究交叉目标训练的有效性
复现研究：ModernBERT 训练方案的首次开源复现

可复现性

所有训练工件均公开可用：

具有精确批次顺序的训练数据
每 8.5B tokens 的模型检查点
完整的超参数配置
训练代码和评估脚本

训练详情

数据： 高质量混合数据，包括 DCLM、Dolma v1.7、科学论文、代码和精选来源，总计 2T+ tokens

架构： 带有 RoPE、GLU 激活和 prenorm 层的 Transformer

训练阶段：

预训练：1.7T tokens 的多样化数据混合
中期训练：250B tokens 的高质量过滤数据，上下文扩展至 8K
衰减阶段：100B tokens 的优质数据源

关键特性：

上下文长度：最长 8K tokens
词汇表：50,368 tokens（ModernBERT 分词器）
遵循 MobileLLM 原则的深而高效的架构

模型架构

参数	17M	32M	68M	150M	400M	1B
层数	7	10	19	22	28	28
隐藏层大小	256	384	512	768	1024	1792
中间层大小	384	576	768	1152	2624	3840
注意力头数	4	6	8	12	16	28

引用

如果您在研究中使用 Ettin 模型，请引用我们的工作：

@misc{weller2025seqvsseqopen,
      title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, 
      author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
      year={2025},
      eprint={2507.11412},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11412}, 
}

许可证

本项目采用 MIT 许可证——详见 LICENSE 文件。

联系方式：有关模型或研究的问题，请提交 issue 或联系作者。

onnx-community/ettin-encoder-32m-ONNX

作者 onnx-community

fill-mask transformers

↓ 0 ♥ 1

创建时间: 2025-12-06 18:31:59+00:00

更新时间: 2025-12-07 20:35:52+00:00

在 Hugging Face 上查看

文件 (15)

.gitattributes

README.md

config.json

onnx/model.onnx ONNX

onnx/model.onnx_data

onnx/model_fp16.onnx ONNX

onnx/model_fp16.onnx_data

onnx/model_q4.onnx ONNX

onnx/model_q4.onnx_data

onnx/model_q4f16.onnx ONNX

onnx/model_q4f16.onnx_data

onnx/model_quantized.onnx ONNX

onnx/model_quantized.onnx_data

tokenizer.json

tokenizer_config.json