Benchmark Logs & Status

GEThttps://agent.blackbox.ai/api/v1/benchmarks/runs/{benchmarkRunId}/logs

Poll a benchmark run's progress, read its orchestrator logs, or stream them live over SSE. Logs come from memory while the run is alive and from durable storage after it ends.

Track a benchmark run end-to-end: a lightweight status poll, a buffered logs snapshot, and a live SSE stream. Logs are served live from memory while the run is alive, and from the database after it ends (or after an unexpected sandbox/process death), so they remain available post-completion.

Authentication

All endpoints require a BLACKBOX API key as a Bearer token and are owner-scoped. See Authentication.

Status — `GET …/runs/{id}/status`

A cheap progress poll.

statusstring

progressnumber

0–100.

completedTasksnumber

Tasks finished so far.

totalTasksnumber

Tasks in the run.

resolvedTasksnumber

Tasks that passed.

resolvedRatenumber | null

Fraction 0..1, set on completion.

{
  "benchmarkRunId": "b41ab8b2-…",
  "status": "running", "progress": 80,
  "completedTasks": 8, "totalTasks": 10, "resolvedTasks": 8,
  "resolvedRate": null, "error": null,
  "startedAt": "2026-06-13T22:47:00.782Z", "completedAt": null
}

Logs snapshot — `GET …/runs/{id}/logs`

Returns the current orchestrator log lines.

sourcestring

live — served from the in-memory buffer (run still active). persisted — served from the durable DB snapshot (run ended and the buffer was cleared).

lineCountnumber

Number of lines returned.

logsarray

The log lines (each timestamped).

{
  "benchmarkRunId": "b41ab8b2-…",
  "status": "completed",
  "source": "persisted",
  "lineCount": 92,
  "logs": [
    "[2026-06-13T22:47:01Z] [prepare] creating builder sandbox...",
    "[2026-06-13T22:54:42Z] [task aime__65] graded numeric_match: answer=\"104\" gold=\"104\" → PASS",
    "[2026-06-13T22:54:46Z] [done] resolved 9/10 (resolved_rate=0.9000)"
  ]
}

Stream — `GET …/runs/{id}/logs/stream`

A text/event-stream (SSE) of the logs.

While the run is alive: replays the buffered lines, then pushes new lines as they arrive; closes with event: end when the run reaches a terminal state.
After the run ends: replays the durable DB snapshot one-shot, then event: end. (This is why a completed run's stream is never empty.)

Each line is data: <log line>\n\n.

curl -N 'https://agent.blackbox.ai/api/v1/benchmarks/runs/RUN_ID/logs/stream' \
  -H 'Authorization: Bearer YOUR_API_KEY'

data: [2026-06-13T22:47:01Z] [prepare] creating builder sandbox...

data: [2026-06-13T22:54:46Z] [done] resolved 9/10 (resolved_rate=0.9000)

event: end
data: {}

Persistence is incremental — logs are mirrored to the database on a short throttle (plus a final flush when the run ends), so even if the sandbox or process dies mid-run, the DB holds a current snapshot. "Live from memory, dead from DB."

Cancel — `POST …/runs/{id}/cancel`

Cancels an in-flight run; returns the new status (cancelled). Completed/failed runs are not affected.

Tasks — `GET …/runs/{id}/tasks`

Per-instance results (instanceId, status, reward, durationMs, error, timestamps). For the aggregated score + traces in one call, prefer Benchmark Results.

Where logs live

Benchmark runs, task results, and logs are stored in a dedicated benchmark database (configured separately from the main app database). This keeps high-volume evaluation data isolated; runs and logs remain queryable by id after completion.

Benchmark Results

Score, percentage, per-task traces, tracking log.

Run Benchmarks

Launch a run (Claude / Codex / Grok, or your own router).

Benchmark Logs & Status

Authentication

Status — GET …/runs/{id}/status

Logs snapshot — GET …/runs/{id}/logs

Stream — GET …/runs/{id}/logs/stream

Cancel — POST …/runs/{id}/cancel

Tasks — GET …/runs/{id}/tasks