FormulaCode

Name: FormulaCode
Creator: FormulaCode
License: https://opensource.org/licenses/MIT

Evaluating Agentic Optimization on Large Codebases

Atharva Sehgal ^1,* James Hou ^2,* Akanksha Sarkar ³ Ishaan Mantripragada ²

Swarat Chaudhuri ¹ Jennifer J. Sun ³ Yisong Yue ²

¹The University of Texas at Austin ²California Institute of Technology ³Cornell University ^*Equal contribution

FormulaCode is a continually updating benchmark for evaluating the holistic ability of LLM agents to optimize codebases. The FormulaCode dataset currently consists of 957 tasks scraped from 245,477 pull requests in 70+ compliant repositories.

Live dashboard → Browse Workloads

Arxiv GitHub Huggingface

Try out a formulacode task with:

$ uv tool install fc-eval

$ fc-eval run --dataset formulacode --task-id shapely_shapely_2283 --config [your-config.json]

Abstract

Read the paper ↗

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior.

We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics.

FormulaCode is a live benchmark comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling evaluation of the full optimization lifecycle—triage, diagnosis, and resolution—under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents.

How does FormulaCode find code optimization tasks?

Each task starts as a merged GitHub pull request and earns its way through four phases — discover, judge, build, verify — surviving only if it ships a measurable speedup.

Datasmith Documentation ↗

Phase 1 · Discover

01 Scan GitHub for repos with performance benchmarks

datasmith/runners/scrape_repos.py

Find repositories that are (1) ASV compatible (contain asv.conf.json), (2) are somewhat popular (have >100 stars), and (3) have mergable PRs. This is done using a CommonSQL BigQuery script or the GitHub Search API.

gh search

$ gh search repos --filename=asv.conf.json --stars=">=100" --language=python

scanning…

01 / 10 · Discover

Dataset Statistics

FormulaCode is updated monthly. Last refreshed on May 12, 2026. For the latest statistics, visit data.formulacode.org.

Data explorer ↗

245,477

Total PRs

13,008

Performance PRs

1,584

Tasks

154

Repos

0.65%

PR → Task Rate

Tasks merged per month

Jan 17 Nov 18 Sep 20 Jul 22 May 24

Repository distribution Top 7 of 145 repos → 50% of stars · Top 22 → 80%

#1 #36 #73 #109 #145

Repository rank by star count (descending)

Problems by repository 1,584 tasks across 133 repos

Key Findings

Each finding maps to a specific figure or table in the FormulaCode paper. Charts and tables are rendered live from the analysis pipeline (or marked “data pending” until the export lands).

Read the paper ↗

Agents improve runtime but underperform experts

Every configuration is faster than the original code (geomean speedup > 1x), yet all finish behind the human expert on advantage. The two metrics also disagree — a few easy tasks lift raw speedup, while advantage normalises against the matching expert patch and gives a more honest read.

Global leaderboard on FormulaCode-V. Negative advantage = trails human expert; positive = beats expert.
Agent	Model	Advantage	Speedup (geomean)
OpenHands	GPT-5	-0.0186	1.0848x
OpenHands	Qwen 3 Coder	-0.0299	1.0348x
Terminus 2	GPT-5	-0.0490	1.0586x
OpenHands	Claude 4.0 Sonnet	-0.0096	1.0553x
Terminus 2	Qwen 3 Coder	-0.0454	1.0677x
Terminus 2	Gemini 2.5 Pro	-0.0847	1.0549x
Terminus 2	Claude 4.0 Sonnet	-0.0422	1.0975x
Human Expert	(oracle)	+0.0000	1.1193x

Table 1 (Global leaderboard) ↗

Local vs global optimization

Each agent–model has a characteristic profile across module → class → function edits. Most are strongest at fine-grained, function-level changes; OpenHands + Claude 4.0 Sonnet inverts the pattern, leading at the module level while ceding ground on smaller-scale edits.

Stratified advantage at each aggregation level for every agent–model configuration. Each line shows whether a configuration favors coarse module-level changes or fine-grained function-level edits.

Model

Claude 4.0 Sonnet
GPT-5
Qwen 3 Coder
Gemini 2.5 Pro

Agent harness

Terminus 2 (solid)
OpenHands (dashed)

Figure 3 (Stratified advantage) ↗

Optimization strategy strengths

Advantage broken out by the strategy the expert used. Agents match or beat experts when the win is parallelisation or batching, but fall behind whenever the human reached for a lower level library (numpy/pandas/etc.) or a vectorised primitive — the categories where capability, not effort, is the bottleneck.

Per-tag advantage for each agent-model configuration. Cells report human-relative advantage restricted to workloads whose expert patches use the labeled optimization strategy. Red = trails expert, blue = beats expert; scaled per column.

Table 2 (Per-tag advantage) ↗

Long-tail repository performance

Agents are weak on Q1 — the least-popular repos, where experts still extract sizeable wins, hinting at both distribution shift and untouched headroom. They close the gap on mid-popularity Q2–Q3, then dip again on Q4, where even the expert struggles to find anything left to optimise.

Advantage across repository popularity quintiles by GitHub stars, from least popular (Q1) to most popular (Q5). Red = trails expert, blue = beats expert; scaled across all cells.

Table 3 (Repository popularity quintiles) ↗

Cost efficiency

The Pareto frontier is dominated by the priciest model (Claude 4.0 Sonnet) — cheaper models burn more tokens inside the agent loop, eroding their per-token savings, and may simply lack the capability to reason about performance edits.

Per-task cost (x) vs. mean advantage over the expert (y) for each agent-model configuration.

Model

Claude 4.0 Sonnet
GPT-5
Gemini 2.5 Pro
Qwen 3 Coder

Agent harness

OpenHands
Terminus 2

Figure 4 / Table 10 (Cost-Performance Pareto) ↗

Multi-workload tradeoffs

Experts sit clear of every agent — they accept larger localised regressions as the price of bigger global wins, a tradeoff agents are reluctant to make.

Global speedup (x) vs. worst per-workload speedup (y) for each agent-model configuration.

Model

Claude 4.0 Sonnet
GPT-5
Gemini 2.5 Pro
Qwen 3 Coder

Agent harness

OpenHands
Terminus 2
Human Expert

Figure 5 (Multi-workload tradeoff) ↗

Temporal generalization

Performance does not consistently dip on tasks created after each model's knowledge cutoff, so the gap to the expert looks capability-bound rather than the result of seeing the answer at training time.

Mean geomean speedup per model in three-month bins relative to its knowledge cutoff (columns left of center = before cutoff, right = after). Cells share a single color scale; darker blue = higher speedup.

Table 4 (Temporal generalization) ↗

Leaderboard at a glance

The findings above roll up into the per-agent advantage scores below. We weren't able to find an agent overall better than human experts — every overall bar trails. Positive bars beat the human expert on a given level; negative bars trail.

Full leaderboard ↗

OpenHands Claude 4.0 Sonnet

-0.0096

OpenHands GPT-5

-0.0186

OpenHands Qwen 3 Coder

-0.0299

Terminus 2 Claude 4.0 Sonnet

-0.0422

Terminus 2 Qwen 3 Coder

-0.0454

Terminus 2 GPT-5

-0.0490

Terminus 2 Gemini 2.5 Pro

-0.0847

Contribute

FormulaCode is a living code optimization benchmark. Help us cover the long tail by opening a request for a particular data source or a particular model.

Join the FormulaCode Discord ↗

Have an interesting problem?

Paste a GitHub issue or PR that describes a real performance bottleneck. We'll evaluate it and, if it qualifies, add it to the benchmark.

Have an interesting model?

Tell us about a model or agent framework you'd like to see on the leaderboard. We'll prioritize accordingly when running the next sweep.