Training LLMs to Discover Abstractions for Solving Reasoning Problems

1Carnegie Mellon University 2Stanford University Equal Contribution
Reinforcement Learning through Abstraction Discovery

Standard reasoning vs. Reasoning abstractions. We depict the solution space as a graph of intermediate steps leading to correct or incorrect answers. (1) Standard reasoning explores this space along one sequential chain. (2) We generate textual abstractions by summarizing which intermediate steps led to which outcomes. (3) Such abstractions can be reused to guide reasoning more efficiently.

Abstract

Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement "algorithmic procedures" that can be used to deduce answers to hard problems. Doing so requires reusing primitives, intermediate results, or procedures across multiple problems. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, the depth-first and "brute-force" nature of reasoning traces learned by these models suggests that this is far from a fulfilled promise. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing several useful abstractions given a problem, followed by RL training that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and an abstraction-conditioned solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that spending more test-time compute into generating abstractions is more beneficial for performance than generating more solutions at large inference-time budgets, illustrating the role of abstractions in guiding global exploration.

Reasoning Abstractions and Why They Are Useful

Solving hard reasoning problems requires more than lengthening chains of thought — it requires reusable insights.

Reasoning abstractions are short natural-language descriptions that capture:

  • Procedural knowledge (e.g., “apply the quadratic formula in modular arithmetic”).
  • Factual knowledge (e.g., "a number has an inverse mod m only if gcd(x, m) = 1").
  • Cautionary patterns (e.g., "avoid assuming a denominator is invertible without checking").

These abstractions summarize what works and what fails across multiple solution attempts. When provided to LLMs:

  • They act like exam hints, guiding the model toward more promising strategies.
  • They improve exploration by broadening the search space beyond sequential brute force.
  • They can generalize across problems — helping models recognize shared substructures or common pitfalls.

Empirically, conditioning on abstractions boosts accuracy and pass@k across math reasoning, ARC program synthesis, and even non-math domains like legal reasoning and healthcare.

Workflow

Figure: Examples of good reasoning abstractions in non-math domains. Adding the abstraction to the prompt of GPT-4o-mini consistently improves performance on unseen instances.

RLAD Framework

Workflow

RLAD jointly trains:

  • Abstraction Generator – proposes problem-specific abstractions.
  • Solution Generator – learns to solve problems by leveraging abstractions.

Training proceeds in two phases:

  • Warm-start with supervised fine-tuning on abstraction – solution pairs from stronger models.
  • Reinforcement learning where abstractions are rewarded if they improve the success rate of solution generation.

Experimental Results

Main Performance Results on Math Reasoning Benchmarks

Approach AIME 2025 DeepScaleR [Hard] AMC 2023
w/o abs (avg) w/ abs (avg) w/ abs (best) w/o abs (avg) w/ abs (avg) w/ abs (best) w/o abs (avg) w/ abs (avg) w/ abs (best)
Qwen-3-1.7B 33.75 36.25 40.00 20.21 22.14 32.50 86.41 78.01 84.53
+ DAPO 37.92 34.90 39.79 21.67 21.88 33.54 86.41 81.99 88.44
+ RLAD 38.04 42.45 48.33 23.54 24.84 35.54 87.25 88.35 91.72

Table: Accuracy on math reasoning benchmarks. RLAD achieves consistent gains in both abstraction-conditioned and w/o abstraction settings across AIME 2025, DeepScaleR Hard, and AMC 2023. We report performance without abstractions, with abstractions (pass@1 with 16 samples), and the best abstraction (pass@16).

A typical example of a reasoning abstraction proposed by our abstraction generator.

Workflow

Figure: In the solution, we see references to the abstraction and keywords from the abstraction being used meaningfully in the reasoning trace of the solution generator model.