Run Benchmarks

POSThttps://agent.blackbox.ai/api/v1/benchmarks/runs

Launch a benchmark evaluation (SWE-bench, GAIA, AIME, GPQA, …) driven by the agent of your choice. Fire-and-forget: returns a benchmarkRunId immediately.

This endpoint starts a benchmark run that evaluates an agent across a dataset of tasks. The run is fire-and-forget — it returns a benchmarkRunId immediately and executes in the background on the managed sandbox. Like tasks, a benchmark can run on the Claude, Codex, or Grok Build agent runtime, selected from model or forced with agent.

After launching, read the score and per-task traces with Benchmark Results, and follow progress with Benchmark Logs & Status.

Authentication

All requests require a BLACKBOX API key as a Bearer token (Pro plan required for POST). See Authentication.

Headers

Authorizationstringrequired

API Key of the form Bearer <api_key>.

Content-Typestringrequired

Must be application/json.

Request Body

benchmarkstringrequired

The benchmark to run. One of: swe-bench, swe-bench-lite, swe-bench-multilingual, swe-gym, swe-bench-multimodal, hle, gaia, aime, gpqa, gpqa-main, omni-math, mmlu-pro, simpleqa. An unknown value returns 400 listing the supported names.

modelstring

Model that drives the agent. The id also selects the agent runtime — Anthropic/Claude ids run the Claude Agent SDK; OpenAI/Codex ids (e.g. blackboxai/openai/gpt-5.3-codex) run the Codex SDK; xAI Grok Build ids run the Grok CLI. See Agent Runtimes.

For the Grok Build runtime, pass the model's full router id — blackboxai/x-ai/grok-build-0.1 — not the bare x-ai/grok-build-0.1. The Grok CLI validates -m against the router's model list and rejects an unprefixed id with "unknown model id".

agentstring

Explicit agent-runtime override — "claude", "codex", or "grok". Overrides the runtime inferred from model. Omit to auto-select (default: claude). An invalid value returns 400 listing the supported agents.

apiKeystring

Bring-your-own router — your OpenAI-compatible router key (bearer token). Must be paired with baseUrl. When supplied, the agent runtime is pointed at your endpoint instead of the platform router, and model is passed through verbatim (no allowlist check). The key is used in-memory for the run only and is never persisted. See Bring-your-own router.

baseUrlstring

Bring-your-own router — your router base URL (e.g. https://my-router.example.com). Must be an http(s) URL and paired with apiKey.

promptstring

Optional extra instruction appended to every task's auto-generated prompt — a global steer applied across the whole run (e.g. "Prefer minimal diffs; add a regression test"). It does not replace the dataset-generated task instruction; it's added after it. Omit for the standard benchmark instruction.

nConcurrentnumberdefault: 16

Maximum concurrent tasks within the run. Range 1–16 (defaults to 16). What this concurrency uses depends on env — see Concurrency & execution backends.

limitnumberdefault: 10

Number of dataset instances to evaluate. Minimum 1; there is no fixed maximum — it's clamped only to the benchmark's own dataset size (totalInstances, e.g. 12,032 for MMLU-Pro, 60 for AIME). Larger runs take proportionally longer and cost more.

timeoutnumber

Per-task agent timeout in seconds. Defaults to the benchmark's own default (e.g. 1800 for SWE-bench, 900 for AIME/MMLU-Pro/SimpleQA).

envstringdefault: sandbox-per-task

Execution backend — how each task's environment is provisioned. See Concurrency & execution backends.

"sandbox-per-task" (default) — each concurrent task runs in its own isolated sandbox (restored from the prepared snapshot). Strong isolation; concurrency is bounded by your account's concurrent-sandbox quota.
"docker-in-parent" — all tasks run as concurrent Docker containers inside a single sandbox. No per-task sandboxes, so concurrency is bounded by that one VM's CPU / RAM instead of the sandbox quota. Cheaper for small runs.

Response

benchmarkRunIdstring

Unique id for the run. Use it to poll status, list tasks, or stream logs.

statusstring

Initial status — "queued".

benchmarkstring

Resolved canonical benchmark name.

modelstring

The model driving the agent (or null for the default).

agentstring

The resolved runtime that will actually run — "claude", "codex", or "grok" — after applying the override or model-based inference. This is also stored on the run, so results report the true agent.

byoobject | null

Echo of bring-your-own router usage (or null). The apiKey is masked: { "baseUrl": "...", "apiKey": "***", "model": "..." }.

promptstring | null

The extra instruction appended to each task (or null if none was provided).

nConcurrentnumber

Resolved concurrency.

limitnumber

Resolved instance count.

timeoutnumber

Resolved per-task timeout (seconds).

envstring

Resolved execution backend.

Request Example

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/anthropic/claude-sonnet-4.5",
    "limit": 20,
    "nConcurrent": 4
  }'

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/openai/gpt-5.3-codex",
    "limit": 20
  }'

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "gpqa",
    "model": "blackboxai/openai/gpt-5.5",
    "agent": "codex",
    "limit": 50
  }'

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "swe-bench-lite",
    "model": "blackboxai/openai/gpt-5.3-codex",
    "agent": "codex",
    "prompt": "Prefer minimal diffs; add a regression test.",
    "limit": 20
  }'

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "aime",
    "model": "blackboxai/x-ai/grok-build-0.1",
    "agent": "grok",
    "limit": 10,
    "nConcurrent": 5
  }'

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "aime",
    "limit": 10,
    "baseUrl": "https://my-router.example.com",
    "apiKey": "sk-my-router-key",
    "model": "my-org/my-model"
  }'

Response Example

{
  "benchmarkRunId": "b1c2d3e4-f5a6-7890-bcde-f12345678901",
  "status": "queued",
  "benchmark": "swe-bench-lite",
  "dataset": "princeton-nlp/SWE-bench_Lite",
  "model": "blackboxai/openai/gpt-5.3-codex",
  "agent": "codex",
  "prompt": "Prefer minimal diffs; add a regression test.",
  "nConcurrent": 4,
  "limit": 20,
  "timeout": 1800,
  "env": "sandbox-per-task"
}

{
  "error": "Unknown benchmark \"swebenchx\". Supported: swe-bench, swe-bench-lite, ..."
}

Bring-your-own router

Supply apiKey + baseUrl (always as a pair) to run the benchmark against your own OpenAI-compatible endpoint and key instead of the platform router. When BYO creds are present:

the agent runtime (claude/codex/grok) is pointed at your baseUrl with your apiKey;
model is passed through verbatim — no allowlist check — so you can evaluate self-hosted or third-party models;
the key is used in-memory for the run only and is never stored;
the response echoes a masked byo block (apiKey shown as ***).

The same apiKey / baseUrl fields are also accepted by Create Task to drive the interactive agent against your endpoint.

Concurrency & execution backends

nConcurrent sets how many tasks run at once; env decides what that concurrency consumes.

	`sandbox-per-task` (default)	`docker-in-parent`
Concurrency unit	One isolated sandbox per task (restored from the prepared snapshot)	One Docker container per task, all inside a single sandbox
Bounded by	Your account's concurrent-sandbox quota (+ model rate limits)	That one VM's CPU / RAM / Docker (+ model rate limits)
Isolation	Strong — separate VMs	Shared VM
Sandbox count	`nConcurrent` sandboxes	1 sandbox total
Best for	Heavy/long tasks (e.g. SWE-bench), strong isolation	Small/cheap runs, or high concurrency without using sandbox quota

To run tasks concurrently inside a single sandbox (rather than one sandbox each), use env: "docker-in-parent":

curl -X POST 'https://agent.blackbox.ai/api/v1/benchmarks/runs' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "benchmark": "aime",
    "model": "blackboxai/x-ai/grok-build-0.1",
    "agent": "grok",
    "limit": 90,
    "nConcurrent": 16,
    "env": "docker-in-parent"
  }'

nConcurrent is capped at 16. In sandbox-per-task the practical limit is your Vercel concurrent-sandbox quota; in docker-in-parent it's the single VM's resources (too many parallel containers + agent processes will saturate CPU/RAM). Raise the cap only alongside the matching infra headroom.

Listing runs & sub-resources

GET /api/v1/benchmarks/runs returns the authenticated user's benchmark runs (optionally filtered by ?status=). Per-run sub-resources live under /api/v1/benchmarks/runs/{benchmarkRunId}/…:

Sub-resource	Purpose
`GET …/results`	Score + percentage + per-task traces + tracking log — see Benchmark Results
`GET …/status`	Lightweight progress poll
`GET …/tasks`	Per-instance task results
`GET …/logs`	Log snapshot (live or persisted)
`GET …/logs/stream`	SSE log stream (live, or DB replay after the run ends)
`POST …/cancel`	Cancel an in-flight run

See Benchmark Logs & Status for the status/logs/stream details.

Error Codes

Status	Description
200	Run queued
400	Missing/invalid `benchmark`, invalid `agent`, or bad JSON
401	Invalid or missing API key
403	Pro subscription required
429	Too many concurrent benchmark runs
500	Failed to launch the run

Benchmark Results

Score, percentage, per-task traces, and the tracking log.

Benchmark Logs & Status

Poll status and stream logs (live, then from the DB).

Agent Runtimes

How model / agent choose Claude vs Codex vs Grok.

Models

Model ids and their runtime mapping.