Scores
Leaderboard
Detailed Comparison
Scoring Methodology
Each provider is scored from 0-100 based on weighted metrics. The benchmark measures the full lifecycle of an AI agent interacting with a sandbox: authenticate, create, execute code, read/write files, and destroy. When extended suites are run, a Capabilities weight is added and other weights are adjusted.
Grades: A (85-100) · B (70-84) · C (55-69) · D (40-54) · F (0-39)