返回模型

说明文档

[!Note] 本仓库对应 Gemma 3n E2B IT (Instruct) 的发布版本，用于配合 Hugging Face transformers.js 使用，支持文本、音频和视觉（图像和视频）输入。

Gemma 3n 模型具有多项架构创新：

它们基于有效参数量提供两种尺寸。虽然该模型的原始参数量为 6B，但架构设计允许通过将低利用率矩阵从加速器卸载，使模型以相当于传统 2B 模型的内存占用运行。

它们采用 MatFormer 架构，允许在 E4B 模型中嵌套子模型。我们提供了一个子模型（即本模型仓库），您也可以使用 Mix-and-Match 方法访问一系列自定义尺寸的模型。

了解更多关于这些技术的信息，请参阅技术博客文章和 Gemma 文档。

Gemma 3n 模型卡片

模型页面: Gemma 3n

资源和技术文档:

使用条款: 条款
作者: Google DeepMind

模型信息

摘要描述以及输入和输出的简要定义。

描述

Gemma 是 Google 推出的轻量级、最先进开源模型系列，采用与创建 Gemini 模型相同的研究和技术构建。 Gemma 3n 模型专为在低资源设备上高效运行而设计。它们支持多模态输入，可处理文本、图像、视频和音频输入，并生成文本输出，提供预训练和指令微调版本的开源权重。这些模型使用超过 140 种口语的数据进行训练。

Gemma 3n 模型采用选择性参数激活技术来降低资源需求。该技术允许模型以 2B 和 4B 参数的有效大小运行，低于其包含的参数总数。有关 Gemma 3n 高效参数管理技术的更多信息，请参阅 Gemma 3n 页面。

输入和输出

输入:
- 文本字符串，例如问题、提示或需要摘要的文档
- 图像，标准化为 256x256、512x512 或 768x768 分辨率，并编码为每个 256 个 token
- 音频数据，从单声道编码为每秒 6.25 个 token
- 总输入上下文为 32K 个 token
输出:
- 针对输入生成的文本，例如问题的回答、图像内容分析或文档摘要
- 总输出长度最多 32K 个 token，需减去请求输入的 token

使用方法

下面是一些代码片段，帮助您快速开始运行模型。您可以复制与您的使用场景相关的部分。

Transformers.js

首先，安装 Transformers.js 库。 Gemma 3n 从 transformers.js 版本 3.6.0 开始支持。

npm i @huggingface/transformers

[!WARNING]
由于模型较大，我们目前仅支持 Node.js、Deno 和 Bun 运行。浏览器内 WebGPU 支持正在积极开发中，请关注后续更新！

示例: 为图像生成描述

import {
  AutoProcessor,
  AutoModelForImageTextToText,
  load_image,
  TextStreamer,
} from \"@huggingface/transformers\";

// 加载处理器和模型
const model_id = \"onnx-community/gemma-3n-E2B-it-ONNX\";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: \"q8\",
    audio_encoder: \"q8\",
    vision_encoder: \"fp16\",
    decoder_model_merged: \"q4\",
  },
  device: \"cpu\", // 注意：WebGPU 支持即将推出！
});

// 准备提示
const messages = [
  {
    role: \"user\",
    content: [
      { type: \"image\" },
      { type: \"text\", text: \"Describe this image in detail.\" },
    ],
  },
];
const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
});

// 准备输入
const url = \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg\";
const image = await load_image(url);
const audio = null;
const inputs = await processor(prompt, image, audio, {
  add_special_tokens: false,
});

// 生成输出
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, {
    skip_prompt: true,
    skip_special_tokens: false,
    // callback_function: (text) => { /* 对流式输出进行处理 */ },
  }),
});

// 解码输出
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);

<summary>查看示例输出</summary>

The image is a close-up, slightly macro shot of a cluster of vibrant pink cosmos flowers in full bloom. The flowers are the focal point, with their delicate, slightly ruffled petals radiating outwards. They have a soft, almost pastel pink hue, and their edges are subtly veined. 

A small, dark-colored bee is actively visiting one of the pink flowers, its body positioned near the center of the bloom. The bee appears to be collecting pollen or nectar. 

The flowers are attached to slender, brownish-green stems, and some of the surrounding foliage is visible in a blurred background, suggesting a natural outdoor setting. There are also hints of other flowers in the background, including some red ones, adding a touch of contrast to the pink. 

The lighting in the image seems to be natural daylight, casting soft shadows and highlighting the textures of the petals and the bee. The overall impression is one of delicate beauty and the gentle activity of nature.

</details>

示例: 转录音频

import {
  AutoProcessor,
  AutoModelForImageTextToText,
  TextStreamer,
} from \"@huggingface/transformers\";
import wavefile from \"wavefile\";

// 加载处理器和模型
const model_id = \"onnx-community/gemma-3n-E2B-it-ONNX\";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: \"q8\",
    audio_encoder: \"q4\",
    vision_encoder: \"fp16\",
    decoder_model_merged: \"q4\",
  },
  device: \"cpu\", // 注意：WebGPU 支持即将推出！
});

// 准备提示
const messages = [
  {
    role: \"user\",
    content: [
      { type: \"audio\" },
      { type: \"text\", text: \"Transcribe this audio verbatim.\" },
    ],
  },
];
const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
});

// 准备输入
const url = \"https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav\";
const buffer = Buffer.from(await fetch(url).then((x) => x.arrayBuffer()));
const wav = new wavefile.WaveFile(buffer);
wav.toBitDepth(\"32f\"); // 管道期望输入为 Float32Array
wav.toSampleRate(processor.feature_extractor.config.sampling_rate);
let audioData = wav.getSamples();
if (Array.isArray(audioData)) {
  if (audioData.length > 1) {
    for (let i = 0; i < audioData[0].length; ++i) {
      audioData[0][i] = (Math.sqrt(2) * (audioData[0][i] + audioData[1][i])) / 2;
    }
  }
  audioData = audioData[0];
}

const image = null;
const audio = audioData;
const inputs = await processor(prompt, image, audio, {
  add_special_tokens: false,
});

// 生成输出
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, {
    skip_prompt: true,
    skip_special_tokens: false,
    // callback_function: (text) => { /* 对流式输出进行处理 */ },
  }),
});

// 解码输出
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);

<summary>查看示例输出</summary>

And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.

</details>

ONNXRuntime

import onnxruntime
import numpy as np
from transformers import AutoConfig, AutoProcessor
import os

# 1. 加载模型
## 加载配置和处理器
model_id = \"google/gemma-3n-E2B-it\"
processor = AutoProcessor.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)

## 加载会话
model_dir          = \"/path/to/model/files/\"
embed_model_path   = os.path.join(model_dir, \"onnx/embed_tokens_quantized.onnx\")
audio_model_path   = os.path.join(model_dir, \"onnx/audio_encoder.onnx\")
vision_model_path  = os.path.join(model_dir, \"onnx/vision_encoder.onnx\")
decoder_model_path = os.path.join(model_dir, \"onnx/decoder_model_merged_q4.onnx\")
vision_session     = onnxruntime.InferenceSession(vision_model_path)
audio_session      = onnxruntime.InferenceSession(audio_model_path)
embed_session      = onnxruntime.InferenceSession(embed_model_path)
decoder_session    = onnxruntime.InferenceSession(decoder_model_path)

## 设置配置值
num_key_value_heads = config.text_config.num_key_value_heads
head_dim = config.text_config.head_dim
num_hidden_layers = config.text_config.num_hidden_layers
eos_token_id = 106 # != config.text_config.eos_token_id
image_token_id = config.image_token_id
audio_token_id = config.audio_token_id


# 2. 准备输入
## 创建输入消息
messages = [
    {
        \"role\": \"user\",
        \"content\": [
            {\"type\": \"text\", \"text\": \"In detail, describe the following audio and image.\"},
            {\"type\": \"audio\", \"audio\": \"https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav\"},
            {\"type\": \"image\", \"image\": \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg\"},
        ],
    },
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors=\"pt\",
)
input_ids = inputs[\"input_ids\"].numpy()
attention_mask = inputs[\"attention_mask\"].numpy()
position_ids = np.cumsum(attention_mask, axis=-1) - 1

pixel_values = inputs[\"pixel_values\"].numpy() if \"pixel_values\" in inputs else None
input_features = inputs[\"input_features\"].numpy().astype(np.float32) if \"input_features\" in inputs else None
input_features_mask = inputs[\"input_features_mask\"].numpy() if \"input_features_mask\" in inputs else None

## 准备解码器输入
batch_size = input_ids.shape[0]
past_key_values = {
    f\"past_key_values.{layer}.{kv}\": np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
    for layer in range(num_hidden_layers)
    for kv in (\"key\", \"value\")
}

# 3. 生成循环
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
image_features = None
audio_features = None
for i in range(max_new_tokens):
    inputs_embeds, per_layer_inputs = embed_session.run(None, {\"input_ids\": input_ids})
    if image_features is None and pixel_values is not None:
        image_features = vision_session.run(
            [\"image_features\"],
            {
                \"pixel_values\": pixel_values,
            }
        )[0]
        mask = (input_ids == image_token_id).reshape(-1)
        flat_embeds = inputs_embeds.reshape(-1, inputs_embeds.shape[-1])
        flat_embeds[mask] = image_features.reshape(-1, image_features.shape[-1])
        inputs_embeds = flat_embeds.reshape(inputs_embeds.shape)

    if audio_features is None and input_features is not None and input_features_mask is not None:
        audio_features = audio_session.run(
            [\"audio_features\"],
            {
                \"input_features\": input_features,
                \"input_features_mask\": input_features_mask,
            }
        )[0]
        mask = (input_ids == audio_token_id).reshape(-1)
        flat_embeds = inputs_embeds.reshape(-1, inputs_embeds.shape[-1])
        flat_embeds[mask] = audio_features.reshape(-1, audio_features.shape[-1])
        inputs_embeds = flat_embeds.reshape(inputs_embeds.shape)

    logits, *present_key_values = decoder_session.run(None, dict(
        inputs_embeds=inputs_embeds,
        per_layer_inputs=per_layer_inputs,
        position_ids=position_ids,
        **past_key_values,
    ))

    ## 更新下一次生成循环的值
    input_ids = logits[:, -1].argmax(-1, keepdims=True)
    attention_mask = np.ones_like(input_ids)
    position_ids = position_ids[:, -1:] + 1
    for j, key in enumerate(past_key_values):
        past_key_values[key] = present_key_values[j]

    generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
    if (input_ids == eos_token_id).all():
        break

    ## （可选）流式输出
    print(processor.decode(input_ids[0]), end=\"\", flush=True)
print()

# 4. 输出结果
print(processor.batch_decode(generated_tokens, skip_special_tokens=True)[0])

引用

@article{gemma_3n_2025,
    title={Gemma 3n},
    url={https://ai.google.dev/gemma/docs/gemma-3n},
    publisher={Google DeepMind},
    author={Gemma Team},
    year={2025}
}

模型数据

用于模型训练的数据及其处理方式。

训练数据集

这些模型在包含多种来源的数据集上进行训练，总计约 11 万亿个 token。训练数据的知识截止日期为 2024 年 6 月。以下是主要组成部分：

网页文档: 多样化的网页文本集合确保模型接触到广泛的语言风格、主题和词汇。训练数据集包含超过 140 种语言的内容。
代码: 让模型接触代码有助于其学习编程语言的语法和模式，从而提高生成代码和理解代码相关问题的能力。
数学: 在数学文本上训练有助于模型学习逻辑推理、符号表示和解决数学问题。
图像: 广泛的图像使模型能够执行图像分析和视觉数据提取任务。
音频: 多样化的声音样本使模型能够识别语音、从录音中转录文本以及识别音频数据中的信息。

这些多样化数据源的组合对于训练能够处理各种不同任务和数据格式的强大多模态模型至关重要。

数据预处理

以下是应用于训练数据的关键数据清洗和过滤方法：

CSAM 过滤: 在数据准备过程的多个阶段应用了严格的 CSAM（儿童性虐待材料）过滤，以确保排除有害和非法内容。
敏感数据过滤: 作为使 Gemma 预训练模型安全可靠的一部分，使用自动化技术从训练集中过滤掉某些个人信息和其他敏感数据。
其他方法: 基于我们的政策进行内容质量和安全性过滤。

实现信息

关于模型内部结构的详细信息。

硬件

Gemma 使用张量处理单元 (TPU) 硬件（TPUv4p、TPUv5p 和 TPUv5e）进行训练。训练生成模型需要大量计算能力。TPU 专为机器学习中常见的矩阵运算而设计，在此领域具有多项优势：

性能: TPU 专为处理训练生成模型所需的大量计算而设计。与 CPU 相比，它们可以显著加速训练。
内存: TPU 通常配备大容量高带宽内存，允许在训练期间处理大型模型和批量大小。这可以提高模型质量。
可扩展性: TPU Pod（大型 TPU 集群）为处理日益复杂的大型基础模型提供了可扩展的解决方案。您可以将训练分布在多个 TPU 设备上，以实现更快、更高效的处理。
成本效益: 在许多场景中，与基于 CPU 的基础设施相比，TPU 可以为训练大型模型提供更具成本效益的解决方案，特别是考虑到更快的训练所节省的时间和资源。

这些优势与 Google 可持续运营的承诺相一致。

软件

训练使用 JAX 和 ML Pathways 进行。 JAX 允许研究人员利用最新一代硬件（包括 TPU）更快、更高效地训练大型模型。ML Pathways 是 Google 构建能够在多个任务之间泛化的人工智能系统的最新成果。这特别适合基础模型，包括像这些大型语言模型。

JAX 和 ML Pathways 的使用方式如关于 Gemini 模型系列的论文所述： "Jax 和 Pathways 的'单控制器'编程模型允许单个 Python 进程协调整个训练运行，大大简化了开发工作流程。"

评估

模型评估指标和结果。

基准测试结果

这些模型在全精度 (float32) 下针对大量不同的数据集和指标进行评估，以覆盖内容生成的不同方面。标有 IT 的评估结果适用于指令微调模型。标有 PT 的评估结果适用于预训练模型。

推理和事实性

基准测试	指标	n-shot	E2B PT	E4B PT
HellaSwag	准确率	10-shot	72.2	78.6
BoolQ	准确率	0-shot	76.4	81.6
PIQA	准确率	0-shot	78.9	81.0
SocialIQA	准确率	0-shot	48.8	50.0
TriviaQA	准确率	5-shot	60.8	70.2
Natural Questions	准确率	5-shot	15.5	20.9
ARC-c	准确率	25-shot	51.7	61.6
ARC-e	准确率	0-shot	75.8	81.6
WinoGrande	准确率	5-shot	66.8	71.7
BIG-Bench Hard	准确率	few-shot	44.3	52.9
DROP	Token F1 分数	1-shot	53.9	60.8

多语言

基准测试	指标	n-shot	E2B IT	E4B IT
MGSM	准确率	0-shot	53.1	60.7
WMT24++ (ChrF)	字符级 F 分数	0-shot	42.7	50.1
Include	准确率	0-shot	38.6	57.2
MMLU (ProX)	准确率	0-shot	8.1	19.9
OpenAI MMLU	准确率	0-shot	22.3	35.6
Global-MMLU	准确率	0-shot	55.1	60.3
ECLeKTic	ECLeKTic 分数	0-shot	2.5	1.9

STEM 和代码

基准测试	指标	n-shot	E2B IT	E4B IT
GPQA Diamond	RelaxedAccuracy/准确率	0-shot	24.8	23.7
LiveCodeBench v5	pass@1	0-shot	18.6	25.7
Codegolf v2.2	pass@1	0-shot	11.0	16.8
AIME 2025	准确率	0-shot	6.7	11.6

其他基准测试

基准测试	指标	n-shot	E2B IT	E4B IT
MMLU	准确率	0-shot	60.1	64.9
MBPP	pass@1	3-shot	56.6	63.6
HumanEval	pass@1	0-shot	66.5	75.0
LiveCodeBench	pass@1	0-shot	13.2	13.2
HiddenMath	准确率	0-shot	27.7	37.7
Global-MMLU-Lite	准确率	0-shot	59.0	64.5
MMLU (Pro)	准确率	0-shot	40.5	50.6

伦理与安全

伦理和安全评估方法及结果。

评估方法

我们的评估方法包括结构化评估和相关内容政策的内部红队测试。红队测试由多个不同的团队进行，每个团队有不同的目标和人工评估指标。这些模型针对与伦理和安全相关的多个不同类别进行了评估，包括：

儿童安全: 评估涵盖儿童安全政策的文本到文本和图像到文本提示，包括儿童性虐待和剥削。
内容安全: 评估涵盖安全政策（包括骚扰、暴力和血腥以及仇恨言论）的文本到文本和图像到文本提示。
代表性伤害: 评估涵盖安全政策（包括偏见、刻板印象和有害关联或不准确）的文本到文本和图像到文本提示。

除了开发级别的评估外，我们还进行"保证评估"，这是我们用于责任治理决策的"独立" 内部评估。它们与模型开发团队分开进行，为发布决策提供信息。高层发现会反馈给模型团队，但提示集被保留以防止过拟合，并保持结果为决策提供信息的能力。显著的保证评估结果会作为发布审查的一部分报告给我们的责任与安全委员会。

评估结果

在所有安全测试领域，与之前的 Gemma 模型相比，我们在儿童安全、内容安全和代表性伤害类别中看到了安全水平的性能。所有测试均在没有安全过滤器的情况下进行，以评估模型的能力和行为。对于文本到文本、图像到文本和音频到文本，以及所有模型尺寸，模型产生的策略违规极少，并且在高严重性违规方面显示出比之前 Gemma 模型性能的显著改进。我们评估的一个局限性是它们主要包括英语提示。

使用和限制

这些模型有一些用户应该意识到的限制。

预期用途

开源生成模型在各个行业和领域有广泛的应用。以下潜在用途列表并不全面。此列表的目的是提供关于模型创建者在模型训练和开发过程中考虑的可能用例的背景信息。

内容创作和通信
- 文本生成: 生成创意文本格式，如诗歌、脚本、代码、营销文案和电子邮件草稿。
- 聊天机器人和对话式 AI: 为客户服务、虚拟助手或交互式应用程序提供对话界面支持。
- 文本摘要: 为文本语料库、研究论文或报告生成简洁摘要。
- 图像数据提取: 为文本通信提取、解释和总结视觉数据。
- 音频数据提取: 转录口语、将语音翻译成其他语言的文本以及分析基于声音的数据。
研究和教育
- 自然语言处理 (NLP) 和生成模型研究: 这些模型可以作为研究人员实验生成模型和 NLP 技术、开发算法并为该领域的进步做出贡献的基础。
- 语言学习工具: 支持交互式语言学习体验，帮助语法纠正或提供写作练习。
- 知识探索: 通过生成摘要或回答有关特定主题的问题，帮助研究人员探索大量数据。

限制

训练数据
- 训练数据的质量和多样性显著影响模型的能力。训练数据中的偏差或差距可能导致模型响应的限制。
- 训练数据集的范围决定了模型可以有效处理的主题领域。
上下文和任务复杂性
- 模型更擅长可以用清晰提示和指令来框架的任务。开放式或高度复杂的任务可能具有挑战性。
- 模型的性能可能受到提供的上下文量的影响（更长的上下文通常会导致更好的输出，在一定程度上）。
语言歧义和细微差别
- 自然语言本质上很复杂。模型可能难以理解微妙的细微差别、讽刺或比喻性语言。
事实准确性
- 模型根据从训练数据集中学到的信息生成响应，但它们不是知识库。它们可能生成不正确或过时的事实陈述。
常识
- 模型依赖语言中的统计模式。它们可能缺乏在某些情况下应用常识推理的能力。

伦理考量和风险

生成模型的开发引发了一些伦理问题。在创建开源模型时，我们仔细考虑了以下几点：

偏见和公平性
- 在大规模、真实世界文本和图像数据上训练的生成模型可能反映训练材料中嵌入的社会文化偏见。这些模型经过了仔细审查、本卡片中描述的输入数据预处理和报告的后评估。
虚假信息和滥用
- 生成模型可能被滥用于生成虚假、误导性或有害的文本。
- 提供了负责任使用模型的指南，请参阅负责任的生成式 AI 工具包。
透明度和问责制:
- 此模型卡片总结了有关模型架构、能力、限制和评估过程的详细信息。
- 负责任开发的开源模型提供了通过使生成模型技术可供 AI 生态系统中的开发人员和研究人员使用来分享创新的机会。

已识别的风险和缓解措施：

偏见的延续: 鼓励在模型训练、微调和其他用例中进行持续监控（使用评估指标、人工审查）和探索去偏技术。
有害内容的生成: 内容安全机制和指南至关重要。鼓励开发人员谨慎行事，并根据其特定的产品策略和应用用例实施适当的内容安全保障措施。
恶意目的的滥用: 技术限制以及开发人员和最终用户教育可以帮助缓解生成模型的恶意应用。为用户提供了用于标记滥用的教育资源和报告机制。Gemma 模型的禁止用途在 Gemma 禁止使用政策中概述。
隐私侵犯: 模型在经过过滤以删除某些个人信息和其他敏感数据的数据上进行训练。鼓励开发人员遵守隐私法规并使用隐私保护技术。

收益

在发布时，与类似尺寸的模型相比，此模型系列提供了从头开始为负责任的 AI 开发设计的高性能开源生成模型实现。

使用本文档中描述的基准评估指标，这些模型已显示出比其他类似尺寸的开源模型替代品更优越的性能。

bh4/ge2b

作者 bh4

image-text-to-text transformers.js

↓ 1 ♥ 0

创建时间: 2025-08-04 01:56:55+00:00

更新时间: 2025-08-04 15:03:14+00:00

在 Hugging Face 上查看

文件 (16)

.gitattributes

README.md

chat_template.jinja

config.json

configo.json

generation_config.json

onnx/audio_encoder_quantized.onnx ONNX

onnx/decoder_model_merged_quantized.onnx ONNX

onnx/embed_tokens_quantized.onnx ONNX

onnx/vision_encoder_quantized.onnx ONNX

preprocessor_config.json

processor_config.json

quant.py

special_tokens_map.json

tokenizer.json

tokenizer_config.json