Benchmarks
Run Benchmarks
Launch a benchmark evaluation (SWE-bench, GAIA, AIME, GPQA, …) driven by the agent of your choice. Fire-and-forget: returns a benchmarkRunId immediately.
POST
This endpoint starts a benchmark run that evaluates an agent across a dataset of tasks. The run is fire-and-forget — it returns a
benchmarkRunId immediately and executes in the background on the managed sandbox. Like tasks, a benchmark can run on either the Claude or Codex agent runtime, selected from model or forced with agent.
Authentication
All requests require a BLACKBOX API key as a Bearer token (Pro plan required forPOST). See Authentication.
Headers
API Key of the form
Bearer <api_key>.Must be
application/json.Request Body
The benchmark to run. One of:
swe-bench, swe-bench-lite, swe-bench-multilingual, swe-gym, swe-bench-multimodal, hle, gaia, aime, gpqa, gpqa-main, omni-math, mmlu-pro, simpleqa. An unknown value returns 400 listing the supported names.Model that drives the agent. The id also selects the agent runtime — Anthropic/Claude ids run the Claude Agent SDK; OpenAI/Codex ids (e.g.
blackboxai/openai/gpt-5.3-codex) run the Codex SDK. See Agent Runtimes.Explicit agent-runtime override —
"claude" or "codex". Overrides the runtime inferred from model. Omit to auto-select (default: claude). An invalid value returns 400 listing the supported agents.Optional extra instruction appended to every task’s auto-generated prompt — a global steer applied across the whole run (e.g.
"Prefer minimal diffs; add a regression test"). It does not replace the dataset-generated task instruction; it’s added after it. Omit for the standard benchmark instruction.Maximum concurrent tasks within the run. Range
1–16.Number of dataset instances to evaluate. Range
1–500.Per-task agent timeout in seconds. Defaults to the benchmark’s own default (e.g.
1800 for SWE-bench, 900 for AIME/MMLU-Pro/SimpleQA).Execution backend —
"docker-in-parent" or "sandbox-per-task".Response
Unique id for the run. Use it to poll status, list tasks, or stream logs.
Initial status —
"queued".Resolved canonical benchmark name.
The model driving the agent (or
null for the default).The resolved runtime —
"claude" or "codex" — after applying the override or model-based inference.The extra instruction appended to each task (or
null if none was provided).Resolved concurrency.
Resolved instance count.
Resolved per-task timeout (seconds).
Resolved execution backend.
Listing runs
GET /api/v1/benchmarks/runs returns the authenticated user’s benchmark runs (optionally filtered by ?status=). Individual run sub-resources — status, tasks, logs, logs/stream, and cancel — live under /api/v1/benchmarks/runs/{benchmarkRunId}/….
Error Codes
| Status | Description |
|---|---|
| 200 | Run queued |
| 400 | Missing/invalid benchmark, invalid agent, or bad JSON |
| 401 | Invalid or missing API key |
| 403 | Pro subscription required |
| 429 | Too many concurrent benchmark runs |
| 500 | Failed to launch the run |
Agent Runtimes
How
model / agent choose Claude vs Codex.Models
Model ids and their runtime mapping.