MRT

Abstract

Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the "progress" made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcment Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

What Do We Want at Test Time

Easy Problem

2+2=?

<think>

Okay, so I need to figure out what 2 plus 2 equals.

Let me count them out. One, two, and then three, four...

</think>

Step-by-Step Explanation:

- Start with the numbers: 2 and 2.
- Combine their quantities: Add the two numbers together.
- Result: 2 + 2 = 4.

Answer: 4

Hard Problem

Show that the inequality

$\sum_{i=1}^n\sum_{j=1}^n\sqrt{\left|x_i-x_j\right|}\leq\sum_{i=1}^n\sum_{j=1}^n\sqrt{\left|x_i+x_j\right|}$

holds for all real numbers

<think>

Okay, so I need to show that for any real numbers ...

Alternatively, perhaps there's a way to pair terms or use symmetry?

Alternatively, could we relate this inequality to some function property?

...

Time is up

</think>

Step-by-Step Explanation:

...

Generated by DeepSeek-R1

Desideratum 1: Don’t spend too many tokens on easy questions!

On easy problems, we expect the model to quickly arrive at the correct answer without spending too many tokens. However, recent models, i.e., DeepSeek-R1, when asked a simple question like '2+2=?', eventually arrive at the correct answer, but not before going through an unnecessarily complex process. They first perform a direct calculation, then (as highlighted in gray) criticize themselves and re-evaluate the problem from multiple angles.

Desideratum 2: Keep making progress on hard questions!

On hard problems, even though the given token budget is not sufficient to solve the problem, we expect the model to make progress towards the correct answer. However, recent models, i.e., DeepSeek-R1, when asked a hard question like the inequality above, frequently change their logical direction, using phrases like 'Wait', 'Alternatively', and 'Let's try something else'. At one point, they even express frustration - much like a human would. While this mirrors how humans sometimes tackle difficult problems by changing approaches when stuck, there's a crucial difference - humans typically learn from their previous attempts and build upon that knowledge to make progress. The example shown above, however, tends to start fresh each time with seemingly random strategies, rather than leveraging insights from its earlier attempts.

Do Current Models Meet Our Desiderata? A Case Study of DeepSeek-R1

In this study, we wanted to understand how effectively state-of-the-art AI models use their "thinking time" when solving complex problems. We designed an experiment with three distinct approaches to compare:

Full Reasoning Process-We used DeepSeek-R1-Distill-Qwen-32B, a model specifically trained to solve problems by working through multiple episodes of reasoning before giving an answer.
Direct Response Baseline-We compared this against Qwen2.5-32B-Instruct, a different model based on the same architecture but trained to produce "best guess" responses directly without showing extensive reasoning.
Majority Voting on Partial Reasoning - We also tested the DeepSeek-R1 model with a hybrid approach: we cut off its reasoning process at different points and had it produce multiple answer attempts based on partial reasoning, then took a majority vote of these answers.

We tested these approaches on challenging mathematics problems from AIME 2024 and the OmniMATH dataset, measuring performance at different reasoning depths to see which method produced the best results.

Key Findings

Additional reasoning in models trained with outcome reward RL do not consistently yield a performance improvement, particularly for complex problems that require many episodes.
Even when better performance can be achieved by implementing "naive" strategies such as majority voting on fewer episodes, a long sequential chain of thought is unable to implement those.

Problem Formulation: Optimizing Test-Time Compute as Meta RL

Our Meta-RL Perspective

We've reformulated the optimization of test-time compute through the lens of meta-reinforcement learning. By viewing an LLM's output as a sequence of meaningful episodes (attempts, verifications, etc.), we can optimize how the model uses test-time compute in a budget-agnostic way.

Budget-Agnostic

The model learns to be effective regardless of the specific token budget it's given at deployment time. Rather than being optimized for a single fixed token limit, our approach enables LLMs to adapt their reasoning strategy based on available compute, making steady progress with each episode. This flexibility allows the same model to perform well across different compute constraints without needing separate training for each potential budget.

We propose to use cumulative regret to measure how effectively a model makes progress toward solving problems

Cumulative Regret: Our Key Metric

Given k episodes z generated from π(·|x), another LLM μ that computes an estimate of the correct response given episodes so far, and the optimal comparator policy given a j-episode budget as π^*_j, we define cumulative regret parameterized by μ as:

$Cumulative Regret Formula$

Here J_r denotes the expected 0/1 outcome reward attained by LLM μ when conditioning on prior episodes z_0:j-1 produced by π, and J_r(π^*) denotes the reward attained by the best possible budget-agnostic comparator π^* within a j-episode test-time budget.

While cumulative regret is an ideal metric for our objective, we cannot optimize it directly during training. The fundamental challenge is that computing cumulative regret requires access to an optimal comparator policy (π^*) that achieves the highest possible reward within each episode budget. However, this optimal policy is unknown and unavailable—if we had access to it, we wouldn't need to train our model in the first place.

To overcome this limitation, we introduce progress as a practical, measurable surrogate:

Progress: A Practical Alternative

Given prior context c and episode z_j ~ π(·|c), and another meta-prover LLM μ that computes an estimate of the correct response, we define progress made by z_j as:

$Progress Formula$

Progress measures how much each additional episode contributes to improving the model's performance. Intuitively, by maximizing progress across episodes, we're encouraging the model to continually improve its understanding and approach to the problem—which naturally minimizes cumulative regret over time.

The Meta Reinforcement Fine-tuning (MRT) Paradigm

Our MRT approach incorporates progress as a dense reward bonus during training. The standard fine-tuning loss based on the expected final reward is:

$Standard Fine-tuning Loss$

We extend this to incorporate progress, giving rise to the MRT training objective:

$MRT Training Objective$

Unlike traditional reinforcement learning that only rewards final outcomes, MRT:

Evaluates improvement at the episode level rather than token-by-token
Balances exploration and exploitation naturally
Remains effective across various compute budgets
Rewards meaningful progress toward solutions

By training models to minimize cumulative regret through our progress-based reward mechanism, we create LLMs that make efficient use of their test-time compute resources across both simple and complex problems.

Meta Reinforcement Finetuning

Methodology

We implements the meta-reinforcement learning paradigm using online reinforcement learning methods such as GRPO. Here's how it works:

Meta-Prover Policy

We define a meta-prover policy μ that evaluates how well an episode contributes to solving the problem. This policy works by:

Forcefully terminating the thought block with a "time is up" prompt
Having the model produce its best-guess solution given the current reasoning prefix

$Meta-Prover Policy Formula$

For each episode in the reasoning process, we:

Compute rewards for thought prefixes using the meta-prover policy μ
Sample multiple on-policy rollouts conditioned on this prefix, which are evenly divided between:
- Continuing to reason further
- Terminating the thinking trace and producing the best-guess solution
Calculate progress rewards according to our progress definition

During training, we optimize the MRT objective function that incorporates both the standard outcome reward and our progress-based dense reward bonus:

$MRT Training Objective$

While this procedure can be implemented with episode-specific reward bonuses or a single progress-adjusted reward, we opt for the latter approach due to its plug-and-play nature in current outcome-reward RL implementations.

Performance

Model	AIME 2024	AIME 2025	AMC 2023	Minerva MATH	MATH500	Avg.
DeepScaleR-1.5B-Preview	42.8	36.7	83.0	24.6	85.2	54.5
+ outcome-reward RL	44.5 (+1.7)	39.3 (+2.6)	81.5 (-1.5)	24.7	84.9	55.0 (+0.5)
+ length penalty	40.3 (-2.5)	30.3 (-6.4)	77.3 (-5.7)	23.0	83.2	50.8 (-3.7)
+ MRT (Ours)	47.2 (+4.4)	39.7 (+3.0)	83.1 (+0.1)	24.2	85.1	55.9 (+1.4)
R1-Distill-Qwen-1.5B	28.7	26.0	69.9	19.8	80.1	44.9
+ outcome-reward RL	29.8 (+1.1)	27.3 (+1.3)	70.5 (+0.6)	22.1	80.3	46.0 (+1.1)
+ MRT (Ours)	30.3 (+1.6)	29.3 (+3.3)	72.9 (+3.0)	22.5	80.3	47.1 (+2.2)

Key Findings

State-of-the-art results: Models fine-tuned with MRT on top of DeepScaleR-1.5B-Preview achieve state-of-the-art performance for their size. The relative improvement from using MRT is 2-3x greater than the improvement from vanilla outcome-reward RL (GRPO).
Better out-of-distribution robustness: When fine-tuned on AIME problems with DeepScaleR-1.5B, MRT not only performed better on AIME2024 and AIME2025 evaluation sets but also preserved performance on the out-of-distribution AMC2023 dataset compared to outcome-reward RL.
Larger gains with weaker models and broad training data: Performance gains were smaller with the DeepScaleR-1.5B base model (which was already trained with RL) compared to the DeepSeek-R1-Distill-Qwen-1.5B model (which was not).

Token-efficiency of MRT

Beyond improving accuracy, Meta Reinforcement Fine-tuning (MRT) significantly enhances token efficiency when solving complex problems. We evaluated this by comparing models trained with MRT against baseline models using standard outcome-reward RL.

Our evaluation methodology involved training the model with 16K tokens and computing maj@K on multiple reasoning and solution traces. By plotting the tradeoff between maj@K and the sum of tokens used across K generations, we obtained a robust estimate of model performance per token generated during reasoning.

MRT results on DeepSeek-R1-Distill-Qwen-1.5B. We plot maj@k performance of models for k = 1, 2, ..., 10 on AIME 2024 (left) and MATH500 (right). The orange lines correspond to MRT and the green lines correspond to outcome-reward training.

Key Findings

MRT outperforms baseline models by an average of 5% accuracy given the same number of total tokens on AIME 2024
To achieve the same performance as the base model (DeepSeek-R1 distilled Qwen-1.5B), MRT requires: 5x fewer tokens on AIME 2024 and 4x fewer tokens on MATH 500
MRT improves over outcome-reward RL by 1.2-1.6x in token efficiency

Progress Made By MRT Compared to Outcome-Reward Training

We measure regret against a theoretical optimal policy π^* that achieves perfect accuracy in a single episode. To compare different fine-tuning algorithms fairly, we reparameterize regret as a function of token budget C₀ rather than the number of episodes.

For any given token budget C₀, we measure regret by computing:

The average accuracy of all traces that finish within budget C₀
The area between the constant oracle performance of 1.0 and the algorithm's performance at different values of C₀
Performance normalized by C₀

Normalized regret of different algorithms at different deployment @token budgets C₀. The first four points are at budgets 4096, 8192, 12288, and 16384. The next four points in dashed lines are extrapolations to C₀ = 20480, 24576, 28672, and 32768, which correspond to 2, 4, 6, and 8 extensions of the output trace, following the budget forcing technique in s1. We conduct this study on AIME 2025. Observe that MRT leads to the smallest normalized regret, both when evaluating within the maximal budget and when extrapolating to larger budgets.

Key Findings

The normalized regret for MRT decreases faster compared to both the base model and outcome-reward RL when the total token budget C₀ ≤ 16384 (the token budget used for training)
When extrapolating beyond the training budget by forcing the model to continue thinking, MRT continues to attain the lowest normalized regret
This indicates MRT is more effective at attaining low regret at larger budgets, as expected from methods that produce more budget-agnostic scaling

Evolution of Length and Progress over Training

We studied the relationship between progress (the dense reward signal used in MRT) and response length, which has been considered a crucial factor in recent advances from DeepSeek and others. We investigated two key questions:

Question 1: How does length evolve during training with MRT and outcome-reward RL over a fixed prompt distribution?

Evolution of length during RL training. Length largely oscillates around similar values for most of training, after an initial increase from the initialization length.

Key Findings

The average completion length generally oscillates around ~5000 tokens during training with both MRT (RL) and GRPO on the AIME dataset
Compared to GRPO, MRT slightly reduces response length (the orange curve generally falls below the green curve)
This aligns with our expectation that optimizing for progress should lead to some reduction in token length
However, this decrease is not as dramatic as the one seen with an explicit length penalty, which reduces length at the cost of worse performance

Question 2: Question 2: Can the benefits of increasing output token budget be explained by enhanced progress?

Despite the supposed gains from running RL training with a large output budget from the beginning, several studies have found that training at higher budgets results in inefficient compute use. Concurrent work finds that a more effective approach is to:

Initialize RL training with a smaller output token budget (8K tokens)
Expand this budget to 16K after half of the training

This raises the question: What benefits does such a "curriculum" over output token budget provide?

Left:(Source:) DeepScaleR’s average response length and training rewards as training progresses. Right: Regret for 8K and 16K DeepScaleR checkpoints at different budgets. For budgets beyond 8192, we calculate the normalized regret of the 8K checkpoint by extrapolating it with budget forcing. At nearly all budgets, the 8K checkpoint shows lower normalized regret, indicating better progress.

Our comprehensive analysis of the curriculum learning reveals:

Performance increases while length reduces, demonstrating that increasing length is not necessary for performance improvement
The 8K checkpoint attains lower normalized regret compared to the 16K model, meaning each episode makes more efficient progress
When extrapolating the 8K checkpoint to 16K evaluation tokens via budget forcing, it achieves similar normalized regret to the 16K checkpoint
The change in accuracy per token/episode is higher during this initial phase than during the subsequent 16K budget phase
This curriculum approach (starting with 8K, then increasing to 16K) achieves better overall performance than training with a 16K budget from scratch

Key Findings

Simple length penalties improve token efficiency but ultimately sacrifice peak performance.
Using dense rewards in MRT also does reduce length slightly, but increases performance.
Existing approaches for using curricula over the maximum allowed token budget or multi-stage training serves as an implicit way to encourage progress over the course of RL training.

BibTeX


@misc{,
      title={Optimizing Test-Time Compute via Meta Reinforcement Finetuning},
      author={Yuxiao Qu*, Matthew Y. R. Yang*, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar},
      year={2025},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={},
}