Benchmark Results

GEThttps://agent.blackbox.ai/api/v1/benchmarks/runs/{benchmarkRunId}/results

Final (or in-progress) score for a benchmark run — headline percentage, per-task pass/fail with verdict + duration, the underperformers, and an inline tracking log. Everything for one run, by id.

The rich, single-call view of a benchmark run. Unlike status (a lightweight progress poll), this returns the aggregated metric as a percentage plus a per-task breakdown with traces. It works mid-run (partial, reflecting progress) and after completion.

Authentication

All requests require a BLACKBOX API key as a Bearer token. A run is only readable by the user that created it. See Authentication.

Headers

Authorizationstringrequired

API Key of the form Bearer <api_key>.

Path Parameters

benchmarkRunIdstringrequired

The id returned by Run Benchmarks.

Response

percentagenumber

The headline score — resolved / total × 100 (2 decimal places). Prefers the recorded final rate; falls back to live counters mid-run.

resolvedRatenumber | null

Fraction 0..1 recorded on completion (null until then).

metricstring

What the percentage measures — e.g. accuracy (QA) or resolved_rate (SWE).

agentstring

The runtime that actually ran — "claude", "codex", or "grok".

scoreobject

{ resolved, total, completed, failed } counters.

summaryobject

Per-status tally: { resolved, unresolved, errored, pending, running }.

timingobject

Latency rollup: { totalDurationMs, avgTaskMs, slowestTaskMs, fastestTaskMs } — your "where did it lag" signal.

tasksarray

One object per instance, with the trace fields:

task fields

instanceIdstring

Dataset instance id (e.g. aime__60).

statusstring

resolved | unresolved | errored | running | pending.

rewardnumber | null

1 = passed, 0 = failed (null until verified).

durationMsnumber | null

Per-task latency.

sandboxIdstring | null

The isolated sandbox that ran it — for traceability.

gradestring | null

The verifier's verdict line (e.g. answer="204" gold="204" → PASS).

errorstring | null

Error message, if any.

startedAtstring | null

ISO timestamp.

completedAtstring | null

ISO timestamp.

underperformedarray

The failed/errored tasks only, slowest first — each with instanceId, status, durationMs, error, and grade. The "what did it miss / where did it lag" view.

trackingobject

Inline log for tracking, so you can follow a run without a second call:

tracking fields

logSourcestring

live (from the in-memory buffer while running) or persisted (from the DB after the run ends).

logLineCountnumber

Total lines available.

recentLogsarray

The last ~25 log lines.

logsUrlstring

Path to the full log snapshot.

streamUrlstring

Path to the SSE log stream.

Request Example

curl 'https://agent.blackbox.ai/api/v1/benchmarks/runs/RUN_ID/results' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Response Example

{
  "benchmarkRunId": "b41ab8b2-9363-489a-99a1-5af2f9649b92",
  "benchmark": "aime",
  "dataset": "aime-2024-2025",
  "model": "blackboxai/x-ai/grok-build-0.1",
  "agent": "grok",
  "status": "completed",
  "progress": 100,
  "metric": "accuracy",
  "percentage": 90,
  "resolvedRate": 0.9,
  "score": { "resolved": 9, "total": 10, "completed": 10, "failed": 0 },
  "summary": { "resolved": 9, "unresolved": 1, "errored": 0, "pending": 0, "running": 0 },
  "timing": { "totalDurationMs": 465278, "avgTaskMs": 158648, "slowestTaskMs": 365055, "fastestTaskMs": 47887 },
  "tasks": [
    {
      "instanceId": "aime__60",
      "status": "resolved",
      "reward": 1,
      "durationMs": 47887,
      "sandboxId": "bench-b41ab8b2-…-t0",
      "error": null,
      "grade": "[task aime__60] graded numeric_match: answer=\"204\" gold=\"204\" → PASS",
      "startedAt": "2026-06-13T22:48:15.013Z",
      "completedAt": "2026-06-13T22:49:02.900Z"
    }
  ],
  "underperformed": [
    {
      "instanceId": "aime__63",
      "status": "unresolved",
      "durationMs": 365055,
      "error": null,
      "grade": "[task aime__63] unresolved (365.1s)"
    }
  ],
  "tracking": {
    "logSource": "persisted",
    "logLineCount": 92,
    "recentLogs": [ "…", "[done] resolved 9/10 (resolved_rate=0.9000)" ],
    "logsUrl": "/api/v1/benchmarks/runs/b41ab8b2-…/logs",
    "streamUrl": "/api/v1/benchmarks/runs/b41ab8b2-…/logs/stream"
  },
  "error": null,
  "startedAt": "2026-06-13T22:47:00.782Z",
  "completedAt": "2026-06-13T22:54:46.060Z"
}

What you can answer by id

Question	Field
Which tests failed?	`tasks[].status` / `reward`, `underperformed[]`
Where did it lag?	`tasks[].durationMs`, `timing`
What did it answer vs gold?	`tasks[].grade`
Which sandbox ran it?	`tasks[].sandboxId`
How do I track progress live?	`tracking.recentLogs` (+ `streamUrl`)

Benchmark Logs & Status

Poll status and stream the full logs (live, then replayed from the DB).