deepeval

Python ★ 16k updated 2d ago

The LLM Evaluation Framework

An open-source testing framework for AI apps, works like Pytest but with built-in metrics for scoring chatbot, agent, and RAG output quality so you can catch regressions before they ship.

PythonLangChainOpenAI SDKsetup: moderatecomplexity 3/5

DeepEval is an open-source framework for testing large language model (LLM) applications — chatbots, AI agents, retrieval pipelines, and the like. The pitch in the README is that it works "similar to Pytest but specialized for unit testing LLM apps": you write small test cases that check whether your AI is doing what you expect, and the framework runs them and reports the results.

The hard part of testing an LLM is that there is rarely a single correct answer to compare against, so DeepEval ships a catalogue of ready-made metrics that score outputs in different ways. Some are general-purpose, like G-Eval (a research-backed approach that uses another LLM as a judge against custom criteria) and DAG (a graph-based deterministic judge builder). Others are grouped by use case: agentic metrics such as task completion, tool correctness, and plan adherence; RAG metrics such as answer relevancy, faithfulness, and contextual recall; multi-turn metrics for chatbots covering knowledge retention and role adherence; and MCP-specific metrics. The metrics can be powered by any LLM you choose, by statistical methods, or by smaller NLP models that run locally on your machine.

You would reach for DeepEval when you are building an AI app and want a repeatable way to know whether a change to the prompt, the model, or the retrieval setup actually made the system better — including swapping providers, for example moving from OpenAI to Claude with confidence. It is a Python package, designed to plug into stacks like LangChain or the OpenAI SDK, and pairs with the paid Confident AI platform for storing and sharing test runs.

Where it fits

Write repeatable test cases to check whether your chatbot or AI agent gives correct, relevant answers.
Compare prompt versions or model swaps (e.g. OpenAI to Claude) to measure which actually performs better.
Test a RAG pipeline for faithfulness and contextual recall after changing the retrieval or chunking setup.
Add LLM quality gates to CI so model regressions are caught before a prompt or model change ships to users.

Open on GitHub → Full breakdown on explaingit →