Benchmark Logs & Status
Benchmarks
Benchmark Logs & Status
Poll a benchmark run’s progress, read its orchestrator logs, or stream them live over SSE. Logs come from memory while the run is alive and from durable storage after it ends.
GET
Benchmark Logs & Status
Track a benchmark run end-to-end: a lightweight status poll, a buffered logs snapshot, and a live SSE stream. Logs are served live from memory while the run is alive, and from the database after it ends (or after an unexpected sandbox/process death), so they remain available post-completion.
Status —
A cheap progress poll.
Logs snapshot —
Returns the current orchestrator log lines.
Stream —
A
Cancel —
Cancels an in-flight run; returns the new Tasks —
Per-instance results (
Authentication
All endpoints require a BLACKBOX API key as a Bearer token and are owner-scoped. See Authentication.Status — GET …/runs/{id}/status
A cheap progress poll.
queued | preparing | running | completed | failed | cancelled.0–100.
Tasks finished so far.
Tasks in the run.
Tasks that passed.
Fraction
0..1, set on completion.status
Logs snapshot — GET …/runs/{id}/logs
Returns the current orchestrator log lines.
live — served from the in-memory buffer (run still active). persisted — served from the durable DB snapshot (run ended and the buffer was cleared).Number of lines returned.
The log lines (each timestamped).
logs
Stream — GET …/runs/{id}/logs/stream
A text/event-stream (SSE) of the logs.
- While the run is alive: replays the buffered lines, then pushes new lines as they arrive; closes with
event: endwhen the run reaches a terminal state. - After the run ends: replays the durable DB snapshot one-shot, then
event: end. (This is why a completed run’s stream is never empty.)
data: <log line>\n\n.
stream
SSE
Cancel — POST …/runs/{id}/cancel
Cancels an in-flight run; returns the new status (cancelled). Completed/failed runs are not affected.
Tasks — GET …/runs/{id}/tasks
Per-instance results (instanceId, status, reward, durationMs, error, timestamps). For the aggregated score + traces in one call, prefer Benchmark Results.
Where logs live
Benchmark runs, task results, and logs are stored in a dedicated benchmark database (configured separately from the main app database). This keeps high-volume evaluation data isolated; runs and logs remain queryable by id after completion.Benchmark Results
Score, percentage, per-task traces, tracking log.
Run Benchmarks
Launch a run (Claude / Codex / Grok, or your own router).