Benchmarks
Benchmark Results
Final (or in-progress) score for a benchmark run — headline percentage, per-task pass/fail with verdict + duration, the underperformers, and an inline tracking log. Everything for one run, by id.
GET
The rich, single-call view of a benchmark run. Unlike status (a lightweight progress poll), this returns the aggregated metric as a percentage plus a per-task breakdown with traces. It works mid-run (partial, reflecting progress) and after completion.
Authentication
All requests require a BLACKBOX API key as a Bearer token. A run is only readable by the user that created it. See Authentication.Headers
API Key of the form
Bearer <api_key>.Path Parameters
The id returned by Run Benchmarks.
Response
The headline score —
resolved / total × 100 (2 decimal places). Prefers the recorded final rate; falls back to live counters mid-run.Fraction
0..1 recorded on completion (null until then).What the percentage measures — e.g.
accuracy (QA) or resolved_rate (SWE).The runtime that actually ran —
"claude", "codex", or "grok".{ resolved, total, completed, failed } counters.Per-status tally:
{ resolved, unresolved, errored, pending, running }.Latency rollup:
{ totalDurationMs, avgTaskMs, slowestTaskMs, fastestTaskMs } — your “where did it lag” signal.One object per instance, with the trace fields:
The failed/errored tasks only, slowest first — each with
instanceId, status, durationMs, error, and grade. The “what did it miss / where did it lag” view.Inline log for tracking, so you can follow a run without a second call:
What you can answer by id
| Question | Field |
|---|---|
| Which tests failed? | tasks[].status / reward, underperformed[] |
| Where did it lag? | tasks[].durationMs, timing |
| What did it answer vs gold? | tasks[].grade |
| Which sandbox ran it? | tasks[].sandboxId |
| How do I track progress live? | tracking.recentLogs (+ streamUrl) |
Benchmark Logs & Status
Poll status and stream the full logs (live, then replayed from the DB).