modelregression

Python ★ 30 updated 11d ago

An automated daily benchmark that runs 30 coding tasks against AI models like Claude, GPT, and Grok, then flags when a model's quality silently drops after an update. Results and score history are published on a public dashboard.

PythonSQLiteNext.js

ModelRegression is an independent benchmarking project that tracks how well AI coding tools perform over time and automatically flags when a model's quality drops. AI providers update their models frequently, sometimes without announcing changes, so this project runs the same fixed set of tests against each model every day and publishes the results at modelregression.com.

The benchmark suite contains 30 tests spread across 10 categories: multi-step logical reasoning, coding tasks, bug fixing, feature implementation, edge case coverage, how safely a model can refactor code without introducing new bugs, security awareness (such as recognizing SQL injection or cross-site scripting risks), instruction following, code quality, and performance efficiency. Each day the tests run automatically at 3am and scores are stored in a local SQLite database. If a model's score drops more than 5% from its recent average, the system flags a regression; drops above 10% or 20% are escalated to higher severity levels.

The models tested are Claude Opus 4.8 and Claude Sonnet 4.6 from Anthropic, GPT-5.5 from OpenAI, and Grok from xAI. Crucially, each model is tested through its official command-line tool rather than through a raw API call, so the benchmarks reflect the full experience a developer would actually have.

The website is built with Next.js and shows a dashboard with scores over time, per-model and per-category detail pages, a side-by-side comparison view, outage history for when models are unreachable, and a page with full evidence for each test run including the original prompts, the model's output, and the score it received. The benchmark engine itself is a Python orchestrator that runs tests in parallel, uses another AI model (Claude Sonnet) as a judge for tests where there is no single correct answer, and exports results to static JSON files that the website reads.

The project is MIT-licensed and open to contributions for new test categories.

Where it fits

Track whether Claude, GPT, or Grok has silently degraded in coding quality after a model update.
Compare AI models side by side on tasks like bug fixing, security awareness, or safe code refactoring.
Get automatic regression alerts when a model's benchmark score drops more than 5% from its recent average.
Add new test categories to the open-source suite to cover gaps in the existing 10 benchmark areas.

Open on GitHub → Full breakdown on explaingit →