Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the "progress" made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcment Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.
2+2=?
<think>
Okay, so I need to figure out what 2 plus 2 equals.
Let me count them out. One, two, and then three, four...
</think>
Step-by-Step Explanation:
Answer: 4
Show that the inequality
holds for all real numbers
<think>
Okay, so I need to show that for any real numbers ...
Alternatively, perhaps there's a way to pair terms or use symmetry?
Alternatively, could we relate this inequality to some function property?
...
Time is up
</think>
Step-by-Step Explanation:
Generated by DeepSeek-R1
Desideratum 1: Don’t spend too many tokens on easy questions!
On easy problems, we expect the model to quickly arrive at the correct answer without spending too many tokens. However, recent models, i.e., DeepSeek-R1, when asked a simple question like '2+2=?', eventually arrive at the correct answer, but not before going through an unnecessarily complex process. They first perform a direct calculation, then (as highlighted in gray) criticize themselves and re-evaluate the problem from multiple angles.
Desideratum 2: Keep making progress on hard questions!
On hard problems, even though the given token budget is not sufficient to solve the problem, we expect the model to make progress towards the correct answer. However, recent models, i.e., DeepSeek-R1, when asked a hard question like the inequality above, frequently change their logical direction, using phrases like
In this study, we wanted to understand how effectively state-of-the-art AI models use their "thinking time" when solving complex problems. We designed an experiment with three distinct approaches to compare:
We tested these approaches on challenging mathematics problems from AIME 2024 and the OmniMATH dataset, measuring performance at different reasoning depths to see which method produced the best results.
We've reformulated the optimization of test-time compute through the lens of meta-reinforcement learning. By viewing an LLM's output as a sequence of meaningful episodes (attempts, verifications, etc.), we can optimize how the model uses test-time compute in a budget-agnostic way.
Budget-Agnostic
The model learns to be effective regardless of the specific token budget it's given at deployment time. Rather than being optimized for a single fixed token limit, our approach enables LLMs to adapt their reasoning strategy based on available compute, making steady progress with each episode. This flexibility allows the same model to perform well across different compute constraints without needing separate training for each potential budget.
We propose to use cumulative regret to measure how effectively a model makes progress toward solving problems
Cumulative Regret: Our Key Metric
Given k episodes z generated from π(·|x), another LLM μ that computes an estimate of the correct response given episodes so far, and the optimal comparator policy given a j-episode budget as π*j, we define cumulative regret parameterized by μ as:
Here Jr denotes the expected 0/1 outcome reward attained by LLM μ when conditioning on prior episodes z0:j-1 produced by π, and Jr(π*) denotes the reward attained by the best possible budget-agnostic comparator π* within a j-episode test-time budget.
While cumulative regret is an ideal metric for our objective, we cannot optimize it directly during training. The fundamental challenge is that computing cumulative regret requires access to an optimal comparator policy (π*) that achieves the highest possible reward within each episode budget. However, this optimal policy is unknown and unavailable—if we had access to it, we wouldn't need to train our model in the first place.
To overcome this limitation, we introduce progress as a practical, measurable surrogate:
Progress: A Practical Alternative
Given prior context c and episode zj ~ π(·|c), and another meta-prover LLM μ that computes an estimate of the correct response, we define progress made by zj as:
Progress measures how much each additional episode contributes to improving the model's performance. Intuitively, by maximizing progress across episodes, we're encouraging the model to continually improve its understanding and approach to the problem—which naturally minimizes cumulative regret over time.
Our MRT approach incorporates progress as a dense reward bonus during training. The standard fine-tuning loss based on the expected final reward is:
We extend this to incorporate progress, giving rise to the MRT training objective:
Unlike traditional reinforcement learning that only rewards final outcomes, MRT:
By training models to minimize cumulative regret through our progress-based reward mechanism, we create LLMs that make efficient use of their test-time compute resources across both simple and complex problems.
We implements the meta-reinforcement learning paradigm using online reinforcement learning methods such as GRPO. Here's how it works:
Meta-Prover Policy
We define a meta-prover policy μ that evaluates how well an episode contributes to solving the problem. This policy works by:
For each episode in the reasoning process, we:
During training, we optimize the MRT objective function that incorporates both the standard outcome reward and our progress-based dense reward bonus:
While this procedure can be implemented with episode-specific reward bonuses or a single progress-adjusted reward, we opt for the latter approach due to its plug-and-play nature in current outcome-reward RL implementations.
Model | AIME 2024 | AIME 2025 | AMC 2023 | Minerva MATH | MATH500 | Avg. |
---|---|---|---|---|---|---|
DeepScaleR-1.5B-Preview | 42.8 | 36.7 | 83.0 | 24.6 | 85.2 | 54.5 |
+ outcome-reward RL | 44.5 (+1.7) | 39.3 (+2.6) | 81.5 (-1.5) | 24.7 | 84.9 | 55.0 (+0.5) |
+ length penalty | 40.3 (-2.5) | 30.3 (-6.4) | 77.3 (-5.7) | 23.0 | 83.2 | 50.8 (-3.7) |
+ MRT (Ours) | 47.2 (+4.4) | 39.7 (+3.0) | 83.1 (+0.1) | 24.2 | 85.1 | 55.9 (+1.4) |
R1-Distill-Qwen-1.5B | 28.7 | 26.0 | 69.9 | 19.8 | 80.1 | 44.9 |
+ outcome-reward RL | 29.8 (+1.1) | 27.3 (+1.3) | 70.5 (+0.6) | 22.1 | 80.3 | 46.0 (+1.1) |
+ MRT (Ours) | 30.3 (+1.6) | 29.3 (+3.3) | 72.9 (+3.0) | 22.5 | 80.3 | 47.1 (+2.2) |
Key Findings
Beyond improving accuracy, Meta Reinforcement Fine-tuning (MRT) significantly enhances token efficiency when solving complex problems. We evaluated this by comparing models trained with MRT against baseline models using standard outcome-reward RL.
Our evaluation methodology involved training the model with 16K tokens and computing maj@K on multiple reasoning and solution traces. By plotting the tradeoff between maj@K and the sum of tokens used across K generations, we obtained a robust estimate of model performance per token generated during reasoning.
MRT results on DeepSeek-R1-Distill-Qwen-1.5B. We plot maj@k performance of models for k = 1, 2, ..., 10 on AIME 2024 (left) and MATH500 (right). The orange lines correspond to MRT and the green lines correspond to outcome-reward training.
Key Findings
We measure regret against a theoretical optimal policy π* that achieves perfect accuracy in a single episode. To compare different fine-tuning algorithms fairly, we reparameterize regret as a function of token budget C0 rather than the number of episodes.
For any given token budget C0, we measure regret by computing:
Normalized regret of different algorithms at different deployment @token budgets C0. The first four points are at budgets 4096, 8192, 12288, and 16384. The next four points in dashed lines are extrapolations to C0 = 20480, 24576, 28672, and 32768, which correspond to 2, 4, 6, and 8 extensions of the output trace, following the budget forcing technique in s1. We conduct this study on AIME 2025. Observe that MRT leads to the smallest normalized regret, both when evaluating within the maximal budget and when extrapolating to larger budgets.
Key Findings
We studied the relationship between progress (the dense reward signal used in MRT) and response length, which has been considered a crucial factor in recent advances from DeepSeek and others. We investigated two key questions:
Question 1: How does length evolve during training with MRT and outcome-reward RL over a fixed prompt distribution?
Evolution of length during RL training. Length largely oscillates around similar values for most of training, after an initial increase from the initialization length.
Key Findings
Question 2: Question 2: Can the benefits of increasing output token budget be explained by enhanced progress?
Despite the supposed gains from running RL training with a large output budget from the beginning, several studies have found that training at higher budgets results in inefficient compute use. Concurrent work finds that a more effective approach is to:
This raises the question: What benefits does such a "curriculum" over output token budget provide?
Left:(Source:) DeepScaleR’s average response length and training rewards as training progresses. Right: Regret for 8K and 16K DeepScaleR checkpoints at different budgets. For budgets beyond 8192, we calculate the normalized regret of the 8K checkpoint by extrapolating it with budget forcing. At nearly all budgets, the 8K checkpoint shows lower normalized regret, indicating better progress.
Our comprehensive analysis of the curriculum learning reveals:
Key Findings
@misc{,
title={Optimizing Test-Time Compute via Meta Reinforcement Finetuning},
author={Yuxiao Qu*, Matthew Y. R. Yang*, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={},
}