Agents improve runtime but underperform experts
Every configuration is faster than the original code (geomean speedup > 1x), yet all finish behind the human expert on advantage. The two metrics also disagree — a few easy tasks lift raw speedup, while advantage normalises against the matching expert patch and gives a more honest read.
| Agent | Model | Advantage | Speedup (geomean) |
|---|---|---|---|
| OpenHands | GPT-5 | -0.0186 | 1.0848x |
| OpenHands | Qwen 3 Coder | -0.0299 | 1.0348x |
| Terminus 2 | GPT-5 | -0.0490 | 1.0586x |
| OpenHands | Claude 4.0 Sonnet | -0.0096 | 1.0553x |
| Terminus 2 | Qwen 3 Coder | -0.0454 | 1.0677x |
| Terminus 2 | Gemini 2.5 Pro | -0.0847 | 1.0549x |
| Terminus 2 | Claude 4.0 Sonnet | -0.0422 | 1.0975x |
| Human Expert | (oracle) | +0.0000 | 1.1193x |