lm-evaluation-harness

Python ★ 13k updated 19d ago

A framework for few-shot evaluation of language models.

The LM Evaluation Harness is a Python framework for benchmarking AI language models on 60+ standardized tasks in a reproducible way, it powers the Hugging Face Open LLM Leaderboard.

PythonvLLMHugging Facesetup: moderatecomplexity 3/5

The Language Model Evaluation Harness is a Python framework for testing AI language models against a wide range of standardized benchmarks. It is maintained by EleutherAI, a research group focused on open AI research. This is the same framework that powers the Hugging Face Open LLM Leaderboard, which many people use to compare the capabilities of publicly available AI models.

The main purpose of the tool is to give researchers and developers a consistent, reproducible way to measure how well a language model performs on tasks like reading comprehension, common sense reasoning, math, coding, and many others. Over 60 standard academic benchmarks are included, covering hundreds of subtasks. Because everyone runs the same prompts in the same way, results from different teams or papers can be compared directly.

The framework supports several ways to load and run models. You can evaluate models from the Hugging Face model library, run models locally using vLLM for faster inference, or call commercial API providers like OpenAI or Anthropic. Quantized models (compressed to use less memory) are also supported through additional optional packages. The base installation is lightweight; you install the model backend separately depending on which kind of model you want to test.

Running an evaluation from the command line involves specifying the model, the tasks to run, and a batch size. Results are printed to the terminal and can be saved to a file. There is also a Python API for running evaluations programmatically inside scripts or notebooks. Custom tasks can be defined using YAML configuration files, which lets you specify prompts, answer extraction logic, and scoring methods without writing Python code.

The project has a changelog showing regular additions including multimodal (text plus image) evaluation support, support for chain-of-thought reasoning traces, and a refactored command-line interface with subcommands. The full README is longer than what was shown.

Where it fits

Run standardized benchmarks on any Hugging Face model to measure its performance on reading comprehension, reasoning, and math tasks.
Compare two language models head-to-head on the same tasks using reproducible settings so results can be cited in research.
Define a custom evaluation task with a YAML file to test a model on your own domain-specific prompts and scoring logic.
Evaluate commercial models via API (OpenAI or Anthropic) against open academic benchmarks and compare them with open-source alternatives.

Open on GitHub → Full breakdown on explaingit →