说明文档
AceReason-Nemotron 1.1:通过 SFT 和 RL 协同推进数学与代码推理
<p align="center">
<img src="fig/main_fig.png" alt="main_fig" style="width: 1000px; max-width: 100%;" />
我们非常高兴地介绍 AceReason-Nemotron-1.1-7B,这是一个基于 Qwen2.5-Math-7B 基座构建的数学与代码推理模型。该模型首先通过监督微调(SFT)在数学和代码任务上进行训练,随后使用与 AceReason-Nemotron-1.0-7B 相同的方案通过强化学习(RL)进一步增强。我们从多个 SFT 模型开始进行 RL 训练,发现更强的 SFT 模型在大规模 RL 后仍然能够持续产生更好的结果,尽管性能差距在 RL 训练过程中会缩小。得益于其更强的 SFT 骨干网络,AceReason-Nemotron-1.1-7B 显著超越了其前身,并在具有挑战性的数学和代码推理基准测试中,在基于 Qwen2.5-7B 的推理模型中创下了最高性能记录。更多详情请参阅我们的技术报告。
结果
我们在 AIME 2024、AIME 2025 以及 LiveCodeBench (LCB) v5(2024/08/01 - 2025/02/01)和 v6(2025/02/01-2025/05/01)上,将我们的模型与同等规模的竞争性推理模型进行了评估。 对于 AceReason-Nemotron-1.0-7B,RL 训练方案使其起始 SFT 模型 DeepSeek-R1-Distill-Qwen-7B 在 AIME24 上提升了 13.5%,在 AIME25 上提升了 14.6%,在 LCB v5 上提升了 14.2%,在 LCB v6 上提升了 10.0%。 相比之下,基于更强 SFT 模型构建的 AceReason-Nemotron-1.1-7B 同样从相同的 RL 方案中获得了显著收益,在 AIME24 上实现了 10.6% 的绝对提升,在 AIME25 上实现了 16.4% 的提升,在 LCB v5 上实现了 8.4% 的提升,在 LCB v6 上实现了 8.3% 的提升。
| 模型 | AIME 2024<br>(avg@64) | AIME 2025<br>(avg@64) | LCB v5<br>(avg@8) | LCB v6<br>(avg@8) |
|---|---|---|---|---|
| <small>Skywork-OR1-7B</small> | 70.2 | 54.6 | 47.6 | 42.7 |
| <small>MiMo-7B-RL</small> | 68.2 | 55.4 | 57.8 | 49.3 |
| <small>o3-mini (low)</small> | 60.0 | 48.3 | 60.9 | - |
| <small>OpenMath-Nemotron-7B</small> | 74.8 | 61.2 | - | - |
| <small>OpenCodeReasoning-Nemotron-7B</small> | - | - | 51.3 | 46.1 |
| <small>Magistral Small (24B)</small> | 70.7 | 62.8 | 55.8 | 47.4 |
| DeepSeek-R1-Distill-Qwen-7B | 55.5 | 39.0 | 37.6 | 34.1 |
| AceReason-Nemotron-1.0-7B | 69.0 | 53.6 | 51.8 | 44.1 |
| Our SFT-7B (starting point of RL) | 62.0 | 48.4 | 48.8 | 43.8 |
| AceReason-Nemotron-1.1-7B 🤗 | 72.6 | 64.8 | 57.2 | 52.1 |
如何使用
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'nvidia/AceReason-Nemotron-1.1-7B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
prompt = "Jen enters a lottery by picking $4$ distinct numbers from $S=\{1,2,3,\cdots,9,10\}.$ $4$ numbers are randomly chosen from $S.$ She wins a prize if at least two of her numbers were $2$ of the randomly chosen numbers, and wins the grand prize if all four of her numbers were the randomly chosen numbers. The probability of her winning the grand prize given that she won a prize is $\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768,
temperature=0.6,
top_p=0.95
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
使用建议
- 我们建议使用以下系统提示词:"You are a helpful and harmless assistant. You should think step-by-step."
- 我们建议对数学问题使用以下指令:
math_question = "MATH_QUESTION"
math_instruction = "Please place your final answer inside \\boxed{}."
system_instruction = "You are a helpful and harmless assistant. You should think step-by-step."
final_prompt = "<|im_start|>system\n" + system_instruction + "<|im_end|>\n<|im_start|>user\n" + math_question + "\n\n" + math_instruction + "<|im_end|>\n<|im_start|>assistant\n\n
"
- 我们建议对代码问题使用以下指令:
code_question = "CODE_QUESTION"
starter_code = "STARTER_CODE" # starter code function header, set empty string ("") if there is no starter code
code_instruction_nostartercode = """Write Python code to solve the problem. Please place the solution code in the following format:\n```python\n# Your solution code here\n```"""
code_instruction_hasstartercode = """Please place the solution code in the following format:\n```python\n# Your solution code here\n```"""
if starter_code != "":
code_question += "\n\n" + "Solve the problem starting with the provided function header.\n\nFunction header:\n" + "```\n" + starter_code + "\n```"
code_question += "\n\n" + code_instruction_hasstartercode
else:
code_question += "\n\n" + code_instruction_nostartercode
final_prompt = "<|im_start|>system\n" + system_instruction + "<|im_end|>\n<|im_start|>user\n" + code_question + "<|im_end|>\n<|im_start|>assistant\n\n
"
- 我们用于评估的推理引擎是 vLLM==0.7.3,使用 top-p=0.95,temperature=0.6,max_tokens=32768。
评估工具包
请参阅 https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/README_EVALUATION.md 中的评估代码和脚本。对于模型推理,请按照使用建议部分中的指南修改提示词。
联系方式
Zihan Liu (zihanl@nvidia.com), Zhuolin Yang (zhuoliny@nvidia.com), Yang Chen (yachen@nvidia.com), Chankyu Lee (chankyul@nvidia.com), Wei Ping (wping@nvidia.com)
许可证
您对本模型的使用受 NVIDIA Open Model License 管辖。
发布日期
2025年6月16日
引用
@article{liu2025acereason,
title={AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy},
author={Liu, Zihan and Yang, Zhuolin and Chen, Yang and Lee, Chankyu and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
journal={arXiv preprint arXiv:2506.13284},
year={2025}
}
onnx-community/AceReason-Nemotron-1.1-7B-Onnx
作者 onnx-community
创建时间: 2025-06-30 07:50:53+00:00
更新时间: 2025-06-30 07:55:00+00:00
在 Hugging Face 上查看