evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
A framework and benchmark library for testing how well AI language models perform, run existing tests or write your own to measure accuracy on tasks specific to your app.
OpenAI Evals is a framework for evaluating large language models (LLMs) — AI systems that generate text — and an open-source registry of benchmark tests for measuring their capabilities. An "eval" in this context is a structured test that runs a model against a set of inputs and measures how well its outputs match expected results.
The project serves two purposes. First, it provides an existing library of benchmarks that test different capabilities of language models. Second, it gives developers a framework to write their own custom evaluations for use cases specific to their application, including private evals that use proprietary data without exposing it publicly.
Custom evals can be built in two ways: model-graded evals, where another language model judges whether the output is correct (these are currently accepted as contributions), or evals with custom Python code (currently not accepted as community submissions). For basic evals, no coding is required — you provide data in JSON format and specify parameters in a YAML configuration file.
To run evals, you need an OpenAI API key and Python 3.9 or later. The eval registry data is stored using Git LFS (Large File Storage), a Git extension for tracking large binary files, which needs to be fetched separately after cloning the repository. Results can optionally be logged to a Snowflake database. An interactive dashboard version is also available directly in the OpenAI platform without needing this codebase.
Where it fits
- Run an existing benchmark from the registry to compare two AI model versions on a set of standardized tasks
- Write a custom eval using a YAML config and JSON test data to measure how well a model handles your app's specific use case
- Build a model-graded eval where one AI judges whether another AI's answers are correct, without writing any code
- Use private proprietary data to evaluate a language model without exposing that data publicly