Skip to main content
GET
/
api
/
v1
/
benchmarks
/
runs
/
{benchmarkRunId}
/
logs
Benchmark Logs & Status
curl --request GET \
  --url https://agent.blackbox.ai/api/v1/benchmarks/runs/{benchmarkRunId}/logs
{
  "status": "<string>",
  "progress": 123,
  "completedTasks": 123,
  "totalTasks": 123,
  "resolvedTasks": 123,
  "resolvedRate": {},
  "source": "<string>",
  "lineCount": 123,
  "logs": [
    {}
  ]
}
Track a benchmark run end-to-end: a lightweight status poll, a buffered logs snapshot, and a live SSE stream. Logs are served live from memory while the run is alive, and from the database after it ends (or after an unexpected sandbox/process death), so they remain available post-completion.

Authentication

All endpoints require a BLACKBOX API key as a Bearer token and are owner-scoped. See Authentication.

Status — GET …/runs/{id}/status

A cheap progress poll.
status
string
queued | preparing | running | completed | failed | cancelled.
progress
number
0–100.
completedTasks
number
Tasks finished so far.
totalTasks
number
Tasks in the run.
resolvedTasks
number
Tasks that passed.
resolvedRate
number | null
Fraction 0..1, set on completion.
status
{
  "benchmarkRunId": "b41ab8b2-…",
  "status": "running", "progress": 80,
  "completedTasks": 8, "totalTasks": 10, "resolvedTasks": 8,
  "resolvedRate": null, "error": null,
  "startedAt": "2026-06-13T22:47:00.782Z", "completedAt": null
}

Logs snapshot — GET …/runs/{id}/logs

Returns the current orchestrator log lines.
source
string
live — served from the in-memory buffer (run still active). persisted — served from the durable DB snapshot (run ended and the buffer was cleared).
lineCount
number
Number of lines returned.
logs
array
The log lines (each timestamped).
logs
{
  "benchmarkRunId": "b41ab8b2-…",
  "status": "completed",
  "source": "persisted",
  "lineCount": 92,
  "logs": [
    "[2026-06-13T22:47:01Z] [prepare] creating builder sandbox...",
    "[2026-06-13T22:54:42Z] [task aime__65] graded numeric_match: answer=\"104\" gold=\"104\" → PASS",
    "[2026-06-13T22:54:46Z] [done] resolved 9/10 (resolved_rate=0.9000)"
  ]
}

Stream — GET …/runs/{id}/logs/stream

A text/event-stream (SSE) of the logs.
  • While the run is alive: replays the buffered lines, then pushes new lines as they arrive; closes with event: end when the run reaches a terminal state.
  • After the run ends: replays the durable DB snapshot one-shot, then event: end. (This is why a completed run’s stream is never empty.)
Each line is data: <log line>\n\n.
stream
curl -N 'https://agent.blackbox.ai/api/v1/benchmarks/runs/RUN_ID/logs/stream' \
  -H 'Authorization: Bearer YOUR_API_KEY'
SSE
data: [2026-06-13T22:47:01Z] [prepare] creating builder sandbox...

data: [2026-06-13T22:54:46Z] [done] resolved 9/10 (resolved_rate=0.9000)

event: end
data: {}
Persistence is incremental — logs are mirrored to the database on a short throttle (plus a final flush when the run ends), so even if the sandbox or process dies mid-run, the DB holds a current snapshot. “Live from memory, dead from DB.”

Cancel — POST …/runs/{id}/cancel

Cancels an in-flight run; returns the new status (cancelled). Completed/failed runs are not affected.

Tasks — GET …/runs/{id}/tasks

Per-instance results (instanceId, status, reward, durationMs, error, timestamps). For the aggregated score + traces in one call, prefer Benchmark Results.

Where logs live

Benchmark runs, task results, and logs are stored in a dedicated benchmark database (configured separately from the main app database). This keeps high-volume evaluation data isolated; runs and logs remain queryable by id after completion.

Benchmark Results

Score, percentage, per-task traces, tracking log.

Run Benchmarks

Launch a run (Claude / Codex / Grok, or your own router).