Skip to main content
Accepted at ICML 2026

FormulaCode

Evaluating Agentic Optimization on Large Codebases

1The University of Texas at Austin 2California Institute of Technology 3Cornell University *Equal contribution

FormulaCode is a continually updating benchmark for evaluating the holistic ability of LLM agents to optimize codebases. The FormulaCode dataset currently consists of 957 tasks scraped from 245,477 pull requests in 70+ compliant repositories.

Try out a formulacode task with:

$ uv tool install fc-eval
$ fc-eval run --dataset formulacode --task-id shapely_shapely_2283 --config [your-config.json]

Abstract

Read the paper ↗

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior.

We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics.

FormulaCode is a live benchmark comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling evaluation of the full optimization lifecycle—triage, diagnosis, and resolution—under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents.

How does FormulaCode find code optimization tasks?

Each task starts as a merged GitHub pull request and earns its way through four phases — discover, judge, build, verify — surviving only if it ships a measurable speedup.

Datasmith Documentation ↗
Phase 1 · Discover

01 Scan GitHub for repos with performance benchmarks

datasmith/runners/scrape_repos.py

Find repositories that are (1) ASV compatible (contain asv.conf.json), (2) are somewhat popular (have >100 stars), and (3) have mergable PRs. This is done using a CommonSQL BigQuery script or the GitHub Search API.

gh search
$ gh search repos --filename=asv.conf.json --stars=">=100" --language=python
scanning…
01 / 10 · Discover

Dataset Statistics

FormulaCode is updated monthly. Last refreshed on May 12, 2026. For the latest statistics, visit data.formulacode.org.

Data explorer ↗
245,477
Total PRs
13,008
Performance PRs
1,584
Tasks
154
Repos
0.65%
PR → Task Rate
Tasks merged per month
Jan 17 Nov 18 Sep 20 Jul 22 May 24
Repository distribution Top 7 of 145 repos → 50% of stars · Top 22 → 80%
#1 #36 #73 #109 #145
Repository rank by star count (descending)
Problems by repository 1,584 tasks across 133 repos

Key Findings

Each finding maps to a specific figure or table in the FormulaCode paper. Charts and tables are rendered live from the analysis pipeline (or marked “data pending” until the export lands).

Read the paper ↗

Agents improve runtime but underperform experts

Every configuration is faster than the original code (geomean speedup > 1x), yet all finish behind the human expert on advantage. The two metrics also disagree — a few easy tasks lift raw speedup, while advantage normalises against the matching expert patch and gives a more honest read.

Global leaderboard on FormulaCode-V. Negative advantage = trails human expert; positive = beats expert.
Agent Model Advantage Speedup (geomean)
OpenHands GPT-5 -0.0186 1.0848x
OpenHands Qwen 3 Coder -0.0299 1.0348x
Terminus 2 GPT-5 -0.0490 1.0586x
OpenHands Claude 4.0 Sonnet -0.0096 1.0553x
Terminus 2 Qwen 3 Coder -0.0454 1.0677x
Terminus 2 Gemini 2.5 Pro -0.0847 1.0549x
Terminus 2 Claude 4.0 Sonnet -0.0422 1.0975x
Human Expert (oracle) +0.0000 1.1193x

Local vs global optimization

Each agent–model has a characteristic profile across module → class → function edits. Most are strongest at fine-grained, function-level changes; OpenHands + Claude 4.0 Sonnet inverts the pattern, leading at the module level while ceding ground on smaller-scale edits.

Stratified advantage at each aggregation level for every agent–model configuration. Each line shows whether a configuration favors coarse module-level changes or fine-grained function-level edits.
-0.15 -0.05 +0.05 +0.15 +0.25 +0.35ModuleClassFunctionStratified advantage
Model
  • Claude 4.0 Sonnet
  • GPT-5
  • Qwen 3 Coder
  • Gemini 2.5 Pro
Agent harness
  • Terminus 2 (solid)
  • OpenHands (dashed)

Optimization strategy strengths

Advantage broken out by the strategy the expert used. Agents match or beat experts when the win is parallelisation or batching, but fall behind whenever the human reached for a lower level library (numpy/pandas/etc.) or a vectorised primitive — the categories where capability, not effort, is the bottleneck.

Per-tag advantage for each agent-model configuration. Cells report human-relative advantage restricted to workloads whose expert patches use the labeled optimization strategy. Red = trails expert, blue = beats expert; scaled per column.
Parallelization Batching Caching Lower-level implementations Algorithmic rewrites Data-structure changes Reduce work Higher-level Micro-optimizations OpenHands Claude 4.0 Sonnet OpenHands GPT-5 OpenHands Qwen 3 Coder Terminus 2 Claude 4.0 Sonnet Terminus 2 GPT-5 Terminus 2 Gemini 2.5 Pro Terminus 2 Qwen 3 Coder Agents lead Agents struggleExpert winsAgent wins

Long-tail repository performance

Agents are weak on Q1 — the least-popular repos, where experts still extract sizeable wins, hinting at both distribution shift and untouched headroom. They close the gap on mid-popularity Q2–Q3, then dip again on Q4, where even the expert struggles to find anything left to optimise.

Advantage across repository popularity quintiles by GitHub stars, from least popular (Q1) to most popular (Q5). Red = trails expert, blue = beats expert; scaled across all cells.
Q1 Q2 Q3 Q4 Q5 OpenHands Claude 4.0 Sonnet OpenHands GPT-5 OpenHands Qwen 3 Coder Terminus 2 Claude 4.0 Sonnet Terminus 2 GPT-5 Terminus 2 Gemini 2.5 Pro Terminus 2 Qwen 3 Coder Agents competitive Agents struggle Headroom-limited dipExpert winsAgent wins

Cost efficiency

The Pareto frontier is dominated by the priciest model (Claude 4.0 Sonnet) — cheaper models burn more tokens inside the agent loop, eroding their per-token savings, and may simply lack the capability to reason about performance edits.

Per-task cost (x) vs. mean advantage over the expert (y) for each agent-model configuration.
-0.100-0.080-0.060-0.040-0.020+0.000$0.00$2.00$4.00$6.00$8.00Mean cost (USD / task) ↓ betterMean advantage ↑ better
Model
  • Claude 4.0 Sonnet
  • GPT-5
  • Gemini 2.5 Pro
  • Qwen 3 Coder
Agent harness
  • OpenHands
  • Terminus 2

Multi-workload tradeoffs

Experts sit clear of every agent — they accept larger localised regressions as the price of bigger global wins, a tradeoff agents are reluctant to make.

Global speedup (x) vs. worst per-workload speedup (y) for each agent-model configuration.
0.860x0.880x0.900x0.920x0.940x0.960x0.980x1.000x1.040x1.060x1.080x1.100xGlobal speedup ↑ betterWorst-workload speedup
Model
  • Claude 4.0 Sonnet
  • GPT-5
  • Gemini 2.5 Pro
  • Qwen 3 Coder
Agent harness
  • OpenHands
  • Terminus 2
  • Human Expert

Temporal generalization

Performance does not consistently dip on tasks created after each model's knowledge cutoff, so the gap to the expert looks capability-bound rather than the result of seeing the answer at training time.

Mean geomean speedup per model in three-month bins relative to its knowledge cutoff (columns left of center = before cutoff, right = after). Cells share a single color scale; darker blue = higher speedup.
6+ mo before 3–6 mo before 0–3 mo before 0–3 mo after 3–6 mo after 6+ mo after Claude 4.0 Sonnet GPT-5 Gemini 2.5 Pro Knowledge cutoffLower speedupHigher speedup

Leaderboard at a glance

The findings above roll up into the per-agent advantage scores below. We weren't able to find an agent overall better than human experts — every overall bar trails. Positive bars beat the human expert on a given level; negative bars trail.

Full leaderboard ↗
OpenHands Claude 4.0 Sonnet
-0.0096
OpenHands GPT-5
-0.0186
OpenHands Qwen 3 Coder
-0.0299
Terminus 2 Claude 4.0 Sonnet
-0.0422
Terminus 2 Qwen 3 Coder
-0.0454
Terminus 2 GPT-5
-0.0490
Terminus 2 Gemini 2.5 Pro
-0.0847

Contribute

FormulaCode is a living code optimization benchmark. Help us cover the long tail by opening a request for a particular data source or a particular model.

Join the FormulaCode Discord ↗

Have an interesting problem?

Paste a GitHub issue or PR that describes a real performance bottleneck. We'll evaluate it and, if it qualifies, add it to the benchmark.

Have an interesting model?

Tell us about a model or agent framework you'd like to see on the leaderboard. We'll prioritize accordingly when running the next sweep.