Skip to main content

Next.js AI Model Performance Evaluations

Blackbox CLI has achieves a State of the art results on Next.js Evals benchmark :
ModelSuccess RateTasks Completed
Claude Sonnet 4.552%26/50
Claude Opus 4.560%30/50
This performance places Blackbox CLI among the top performers in the AI coding agent benchmarks, showcasing its effectiveness in handling complex Next.js development tasks.

About the benchmark

Next.js Evals provides comprehensive performance evaluations of AI models and agents on Next.js code generation and migration tasks. The evaluation framework measures several key metrics:
  • Success Rate: Percentage of successful code generation and migration tasks
  • Execution Time: Average duration to complete tasks
  • Token Usage: Total tokens consumed during evaluations
  • Quality Improvements: Assessment of code quality and best practices
The evaluations are run regularly against the latest Next.js version (currently 15.5.6 as of November 18, 2025) and test various AI models and coding agents on real-world Next.js development scenarios including component creation, API route development, and application migration tasks.

Reproduce the results:

Steps to reproduce the results:
# 1. clone the repo
git clone https://github.com/blackboxai-team/next-evals-oss
cd next-evals-oss

# install bun if its missing
curl -fsSL https://bun.sh/install | bash

# Install dependencies
pnpm install

# setup env vars
export OPENAI_API_KEY="your-openrouter-api-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_MODEL="anthropic/claude-sonnet-4.5" #or anthropic/claude-opus-4.5

# run the full eval
bun blackbox-cli.ts --all --model anthropic/claude-sonnet-4.5

Eval run logs:

Our detailed test results logs of our evaluation run are now publicly available for the review of developers and researchers.