Full benchmark suite complete — E2B, Daytona, CodeSandbox, Blaxel, Sprites.dev, Modal, VMVM (3 runs each, best-of-N).

Scores

Leaderboard

Detailed Comparison

Scoring Methodology

Each provider is scored from 0-100 based on weighted metrics. The benchmark measures the full lifecycle of an AI agent interacting with a sandbox: authenticate, create, execute code, read/write files, and destroy. When extended suites are run, a Capabilities weight is added and other weights are adjusted.

Grades: A (85-100) · B (70-84) · C (55-69) · D (40-54) · F (0-39)