Skip to main content
POST
/
api
/
v1
/
benchmarks
/
runs
curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'
{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}
This endpoint starts a benchmark run that evaluates an agent across a dataset of tasks. The run is fire-and-forget — it returns a benchmarkRunId immediately and executes in the background on the managed sandbox. Like tasks, a benchmark can run on either the Claude or Codex agent runtime, selected from model or forced with agent.

Authentication

All requests require a BLACKBOX API key as a Bearer token (Pro plan required for POST). See Authentication.

Headers

Authorization
string
required
API Key of the form Bearer <api_key>.
Content-Type
string
required
Must be application/json.

Request Body

benchmark
string
required
The benchmark to run. One of: swe-bench, swe-bench-lite, swe-bench-multilingual, swe-gym, swe-bench-multimodal, hle, gaia, aime, gpqa, gpqa-main, omni-math, mmlu-pro, simpleqa. An unknown value returns 400 listing the supported names.
model
string
Model that drives the agent. The id also selects the agent runtime — Anthropic/Claude ids run the Claude Agent SDK; OpenAI/Codex ids (e.g. blackboxai/openai/gpt-5.3-codex) run the Codex SDK. See Agent Runtimes.
agent
string
Explicit agent-runtime override — "claude" or "codex". Overrides the runtime inferred from model. Omit to auto-select (default: claude). An invalid value returns 400 listing the supported agents.
prompt
string
Optional extra instruction appended to every task’s auto-generated prompt — a global steer applied across the whole run (e.g. "Prefer minimal diffs; add a regression test"). It does not replace the dataset-generated task instruction; it’s added after it. Omit for the standard benchmark instruction.
nConcurrent
number
default:"4"
Maximum concurrent tasks within the run. Range 116.
limit
number
default:"10"
Number of dataset instances to evaluate. Range 1500.
timeout
number
Per-task agent timeout in seconds. Defaults to the benchmark’s own default (e.g. 1800 for SWE-bench, 900 for AIME/MMLU-Pro/SimpleQA).
env
string
Execution backend — "docker-in-parent" or "sandbox-per-task".

Response

benchmarkRunId
string
Unique id for the run. Use it to poll status, list tasks, or stream logs.
status
string
Initial status — "queued".
benchmark
string
Resolved canonical benchmark name.
model
string
The model driving the agent (or null for the default).
agent
string
The resolved runtime — "claude" or "codex" — after applying the override or model-based inference.
prompt
string | null
The extra instruction appended to each task (or null if none was provided).
nConcurrent
number
Resolved concurrency.
limit
number
Resolved instance count.
timeout
number
Resolved per-task timeout (seconds).
env
string
Resolved execution backend.
curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'
{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}

Listing runs

GET /api/v1/benchmarks/runs returns the authenticated user’s benchmark runs (optionally filtered by ?status=). Individual run sub-resources — status, tasks, logs, logs/stream, and cancel — live under /api/v1/benchmarks/runs/{benchmarkRunId}/….

Error Codes

StatusDescription
200Run queued
400Missing/invalid benchmark, invalid agent, or bad JSON
401Invalid or missing API key
403Pro subscription required
429Too many concurrent benchmark runs
500Failed to launch the run

Agent Runtimes

How model / agent choose Claude vs Codex.

Models

Model ids and their runtime mapping.