Skip to main content
GET
/
api
/
v1
/
benchmarks
/
runs
/
{benchmarkRunId}
/
results
curl 'https://agent.blackbox.ai/api/v1/benchmarks/runs/RUN_ID/results' \
  -H 'Authorization: Bearer YOUR_API_KEY'
{
  "benchmarkRunId": "b41ab8b2-9363-489a-99a1-5af2f9649b92",
  "benchmark": "aime",
  "dataset": "aime-2024-2025",
  "model": "blackboxai/x-ai/grok-build-0.1",
  "agent": "grok",
  "status": "completed",
  "progress": 100,
  "metric": "accuracy",
  "percentage": 90,
  "resolvedRate": 0.9,
  "score": { "resolved": 9, "total": 10, "completed": 10, "failed": 0 },
  "summary": { "resolved": 9, "unresolved": 1, "errored": 0, "pending": 0, "running": 0 },
  "timing": { "totalDurationMs": 465278, "avgTaskMs": 158648, "slowestTaskMs": 365055, "fastestTaskMs": 47887 },
  "tasks": [
    {
      "instanceId": "aime__60",
      "status": "resolved",
      "reward": 1,
      "durationMs": 47887,
      "sandboxId": "bench-b41ab8b2-…-t0",
      "error": null,
      "grade": "[task aime__60] graded numeric_match: answer=\"204\" gold=\"204\" → PASS",
      "startedAt": "2026-06-13T22:48:15.013Z",
      "completedAt": "2026-06-13T22:49:02.900Z"
    }
  ],
  "underperformed": [
    {
      "instanceId": "aime__63",
      "status": "unresolved",
      "durationMs": 365055,
      "error": null,
      "grade": "[task aime__63] unresolved (365.1s)"
    }
  ],
  "tracking": {
    "logSource": "persisted",
    "logLineCount": 92,
    "recentLogs": [ "…", "[done] resolved 9/10 (resolved_rate=0.9000)" ],
    "logsUrl": "/api/v1/benchmarks/runs/b41ab8b2-…/logs",
    "streamUrl": "/api/v1/benchmarks/runs/b41ab8b2-…/logs/stream"
  },
  "error": null,
  "startedAt": "2026-06-13T22:47:00.782Z",
  "completedAt": "2026-06-13T22:54:46.060Z"
}
The rich, single-call view of a benchmark run. Unlike status (a lightweight progress poll), this returns the aggregated metric as a percentage plus a per-task breakdown with traces. It works mid-run (partial, reflecting progress) and after completion.

Authentication

All requests require a BLACKBOX API key as a Bearer token. A run is only readable by the user that created it. See Authentication.

Headers

Authorization
string
required
API Key of the form Bearer <api_key>.

Path Parameters

benchmarkRunId
string
required
The id returned by Run Benchmarks.

Response

percentage
number
The headline score — resolved / total × 100 (2 decimal places). Prefers the recorded final rate; falls back to live counters mid-run.
resolvedRate
number | null
Fraction 0..1 recorded on completion (null until then).
metric
string
What the percentage measures — e.g. accuracy (QA) or resolved_rate (SWE).
agent
string
The runtime that actually ran — "claude", "codex", or "grok".
score
object
{ resolved, total, completed, failed } counters.
summary
object
Per-status tally: { resolved, unresolved, errored, pending, running }.
timing
object
Latency rollup: { totalDurationMs, avgTaskMs, slowestTaskMs, fastestTaskMs } — your “where did it lag” signal.
tasks
array
One object per instance, with the trace fields:
underperformed
array
The failed/errored tasks only, slowest first — each with instanceId, status, durationMs, error, and grade. The “what did it miss / where did it lag” view.
tracking
object
Inline log for tracking, so you can follow a run without a second call:
curl 'https://agent.blackbox.ai/api/v1/benchmarks/runs/RUN_ID/results' \
  -H 'Authorization: Bearer YOUR_API_KEY'
{
  "benchmarkRunId": "b41ab8b2-9363-489a-99a1-5af2f9649b92",
  "benchmark": "aime",
  "dataset": "aime-2024-2025",
  "model": "blackboxai/x-ai/grok-build-0.1",
  "agent": "grok",
  "status": "completed",
  "progress": 100,
  "metric": "accuracy",
  "percentage": 90,
  "resolvedRate": 0.9,
  "score": { "resolved": 9, "total": 10, "completed": 10, "failed": 0 },
  "summary": { "resolved": 9, "unresolved": 1, "errored": 0, "pending": 0, "running": 0 },
  "timing": { "totalDurationMs": 465278, "avgTaskMs": 158648, "slowestTaskMs": 365055, "fastestTaskMs": 47887 },
  "tasks": [
    {
      "instanceId": "aime__60",
      "status": "resolved",
      "reward": 1,
      "durationMs": 47887,
      "sandboxId": "bench-b41ab8b2-…-t0",
      "error": null,
      "grade": "[task aime__60] graded numeric_match: answer=\"204\" gold=\"204\" → PASS",
      "startedAt": "2026-06-13T22:48:15.013Z",
      "completedAt": "2026-06-13T22:49:02.900Z"
    }
  ],
  "underperformed": [
    {
      "instanceId": "aime__63",
      "status": "unresolved",
      "durationMs": 365055,
      "error": null,
      "grade": "[task aime__63] unresolved (365.1s)"
    }
  ],
  "tracking": {
    "logSource": "persisted",
    "logLineCount": 92,
    "recentLogs": [ "…", "[done] resolved 9/10 (resolved_rate=0.9000)" ],
    "logsUrl": "/api/v1/benchmarks/runs/b41ab8b2-…/logs",
    "streamUrl": "/api/v1/benchmarks/runs/b41ab8b2-…/logs/stream"
  },
  "error": null,
  "startedAt": "2026-06-13T22:47:00.782Z",
  "completedAt": "2026-06-13T22:54:46.060Z"
}

What you can answer by id

QuestionField
Which tests failed?tasks[].status / reward, underperformed[]
Where did it lag?tasks[].durationMs, timing
What did it answer vs gold?tasks[].grade
Which sandbox ran it?tasks[].sandboxId
How do I track progress live?tracking.recentLogs (+ streamUrl)

Benchmark Logs & Status

Poll status and stream the full logs (live, then replayed from the DB).