说明文档
AceReason-Nemotron 1.1:通过SFT和RL协同推进数学与代码推理能力
<p align="center">
<img src="fig/main_fig.png" alt="main_fig" style="width: 1000px; max-width: 100%;" />
我们很高兴地介绍 AceReason-Nemotron-1.1-7B,这是一个基于 Qwen2.5-Math-7B 基座构建的数学和代码推理模型。该模型首先通过监督微调(SFT)在数学和代码任务上进行训练,然后使用与 AceReason-Nemotron-1.0-7B 相同的配方,通过强化学习(RL)进一步增强。我们从多个SFT模型开始进行RL训练,发现更强的SFT模型在大规模RL后继续产生更好的结果,尽管性能差距在RL训练期间会缩小。得益于其更强的SFT骨干网络,AceReason-Nemotron-1.1-7B 显著超越了其前身,并在具有挑战性的数学和代码推理基准测试中,在基于 Qwen2.5-7B 的推理模型中创下了最高性能记录。欲了解更多详情,请查看我们的技术报告。
结果
我们在 AIME 2024、AIME 2025 以及 LiveCodeBench (LCB) v5(2024/08/01 - 2025/02/01)和 v6(2025/02/01-2025/05/01)上,将我们的模型与同等规模的竞争性推理模型进行了对比评估。 对于 AceReason-Nemotron-1.0-7B,RL训练配方将其起始SFT模型 DeepSeek-R1-Distill-Qwen-7B 在 AIME24 上提升了 13.5%,在 AIME25 上提升了 14.6%,在 LCB v5 上提升了 14.2%,在 LCB v6 上提升了 10.0%。 相比之下,基于更强SFT模型构建的 AceReason-Nemotron-1.1-7B 同样从相同的RL配方中大幅受益,在 AIME24 上实现了 10.6% 的绝对提升,在 AIME25 上实现了 16.4% 的绝对提升,在 LCB v5 上实现了 8.4% 的绝对提升,在 LCB v6 上实现了 8.3% 的绝对提升。
| 模型 | AIME 2024<br>(avg@64) | AIME 2025<br>(avg@64) | LCB v5<br>(avg@8) | LCB v6<br>(avg@8) |
|---|---|---|---|---|
| <small>Skywork-OR1-7B</small> | 70.2 | 54.6 | 47.6 | 42.7 |
| <small>MiMo-7B-RL</small> | 68.2 | 55.4 | 57.8 | 49.3 |
| <small>o3-mini (low)</small> | 60.0 | 48.3 | 60.9 | - |
| <small>OpenMath-Nemotron-7B</small> | 74.8 | 61.2 | - | - |
| <small>OpenCodeReasoning-Nemotron-7B</small> | - | - | 51.3 | 46.1 |
| <small>Magistral Small (24B)</small> | 70.7 | 62.8 | 55.8 | 47.4 |
| DeepSeek-R1-Distill-Qwen-7B | 55.5 | 39.0 | 37.6 | 34.1 |
| AceReason-Nemotron-1.0-7B | 69.0 | 53.6 | 51.8 | 44.1 |
| Our SFT-7B (RL的起始点) | 62.0 | 48.4 | 48.8 | 43.8 |
| AceReason-Nemotron-1.1-7B 🤗 | 72.6 | 64.8 | 57.2 | 52.1 |
如何使用
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'nvidia/AceReason-Nemotron-1.1-7B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=\"auto\", device_map=\"auto\")
prompt = \"Jen enters a lottery by picking $4$ distinct numbers from $S=\\{1,2,3,\\cdots,9,10\\}.$ $4$ numbers are randomly chosen from $S.$ She wins a prize if at least two of her numbers were $2$ of the randomly chosen numbers, and wins the grand prize if all four of her numbers were the randomly chosen numbers. The probability of her winning the grand prize given that she won a prize is $\\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$.\"
messages = [{\"role\": \"user\", \"content\": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors=\"pt\").to(\"cuda\")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768,
temperature=0.6,
top_p=0.95
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
使用建议
- 我们建议使用以下系统提示词:"You are a helpful and harmless assistant. You should think step-by-step."
- 我们建议对数学问题使用以下指令:
math_question = \"MATH_QUESTION\"
math_instruction = \"Please place your final answer inside \\boxed{}.\"
system_instruction = \"You are a helpful and harmless assistant. You should think step-by-step.\"
final_prompt = \"<|im_start|>system\n\" + system_instruction + \"<|im_end|>\n<|im_start|>user\n\" + math_question + \"\n\n\" + math_instruction + \"<|im_end|>\n<|im_start|>assistant\n
\n\"
- 我们建议对代码问题使用以下指令:
code_question = \"CODE_QUESTION\"
starter_code = \"STARTER_CODE\" # 起始代码函数头,如果没有起始代码则设置为空字符串 (\"\")
code_instruction_nostartercode = \"\"\"Write Python code to solve the problem. Please place the solution code in the following format:\n```python\n# Your solution code here\n```\"\"\"
code_instruction_hasstartercode = \"\"\"Please place the solution code in the following format:\n```python\n# Your solution code here\n```\"\"\"
if starter_code != \"\":
code_question += \"\n\n\" + \"Solve the problem starting with the provided function header.\n\nFunction header:\n\" + \"```\n\" + starter_code + \"\n```\"
code_question += \"\n\n\" + code_instruction_hasstartercode
else:
code_question += \"\n\n\" + code_instruction_nostartercode
final_prompt = \"<|im_start|>system\n\" + system_instruction + \"<|im_end|>\n<|im_start|>user\n\" + code_question + \"<|im_end|>\n<|im_start|>assistant\n
\n\"
- 我们用于评估的推理引擎是 vLLM==0.7.3,使用 top-p=0.95, temperature=0.6, max_tokens=32768。
评估工具包
请参考 https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/README_EVALUATION.md 中的评估代码和脚本。对于模型推理,请按照使用建议部分中的指南修改提示词。
联系方式
Zihan Liu (zihanl@nvidia.com), Zhuolin Yang (zhuoliny@nvidia.com), Yang Chen (yachen@nvidia.com), Chankyu Lee (chankyul@nvidia.com), Wei Ping (wping@nvidia.com)
许可证
您对该模型的使用受 NVIDIA 开放模型许可证 管辖。
发布日期
2025年6月16日
引用
@article{liu2025acereason,
title={AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy},
author={Liu, Zihan and Yang, Zhuolin and Chen, Yang and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
journal={arXiv preprint arXiv:2506.13284},
year={2025}
}
Prince-1/AceReason-Nemotron-1.1-7B-Onnx
作者 Prince-1
创建时间: 2025-06-22 18:02:27+00:00
更新时间: 2025-06-30 07:49:37+00:00
在 Hugging Face 上查看