Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement "algorithmic procedures" that can be used to deduce answers to hard problems. Doing so requires reusing primitives, intermediate results, or procedures across multiple problems. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, the depth-first and "brute-force" nature of reasoning traces learned by these models suggests that this is far from a fulfilled promise. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing several useful abstractions given a problem, followed by RL training that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and an abstraction-conditioned solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that spending more test-time compute into generating abstractions is more beneficial for performance than generating more solutions at large inference-time budgets, illustrating the role of abstractions in guiding global exploration.
Solving hard reasoning problems requires more than lengthening chains of thought — it requires reusable insights.
Reasoning abstractions are short natural-language descriptions that capture:
These abstractions summarize what works and what fails across multiple solution attempts. When provided to LLMs:
Empirically, conditioning on abstractions boosts accuracy and pass@k across math reasoning, ARC program synthesis, and even non-math domains like legal reasoning and healthcare.
Figure: Examples of good reasoning abstractions in non-math domains. Adding the abstraction to the prompt of GPT-4o-mini consistently improves performance on unseen instances.
RLAD jointly trains:
Training proceeds in two phases:
Main Performance Results on Math Reasoning Benchmarks
Approach | AIME 2025 | DeepScaleR [Hard] | AMC 2023 | ||||||
---|---|---|---|---|---|---|---|---|---|
w/o abs (avg) | w/ abs (avg) | w/ abs (best) | w/o abs (avg) | w/ abs (avg) | w/ abs (best) | w/o abs (avg) | w/ abs (avg) | w/ abs (best) | |
Qwen-3-1.7B | 33.75 | 36.25 | 40.00 | 20.21 | 22.14 | 32.50 | 86.41 | 78.01 | 84.53 |
+ DAPO | 37.92 | 34.90 | 39.79 | 21.67 | 21.88 | 33.54 | 86.41 | 81.99 | 88.44 |
+ RLAD | 38.04 | 42.45 | 48.33 | 23.54 | 24.84 | 35.54 | 87.25 | 88.35 | 91.72 |
Table: Accuracy on math reasoning benchmarks. RLAD achieves consistent gains in both abstraction-conditioned and w/o abstraction settings across AIME 2025, DeepScaleR Hard, and AMC 2023. We report performance without abstractions, with abstractions (pass@1 with 16 samples), and the best abstraction (pass@16).
A typical example of a reasoning abstraction proposed by our abstraction generator.
Figure: In the solution, we see references to the abstraction and keywords from the abstraction being used meaningfully in the reasoning trace of the solution generator model.