
fc-eval
Run frontier LLM agents against the FormulaCode benchmark. Spins up reproducible Docker environments, verifies correctness against the unit-test suite, and computes per-workload speedup, advantage, and stratified scores.
FormulaCode consists of two parts: a pipeline to construct performance optimization tasks, and an execution harness that connects a language model to our terminal sandbox.

Run frontier LLM agents against the FormulaCode benchmark. Spins up reproducible Docker environments, verifies correctness against the unit-test suite, and computes per-workload speedup, advantage, and stratified scores.

The pipeline for curating FormulaCode's tasks from real GitHub repositories. The code for scraping, filtering, building, and verifying high-quality performance PRs is maintained here.
Two subdomains expose the live task and run database. Uptime is not guaranteed — these are research endpoints, sometimes rebuilt mid-week. For reproducible evaluation, prefer the static CSV that ships with this site.
Read-only Supabase REST. Tables: repositories, pull_requests, candidate_containers, harbor_runs. Anonymous key required (see fc-eval docs).
Browseable Supabase Studio with the live task and run tables. Useful for ad-hoc inspection and SQL.