MemoryArena

Python ★ 23 updated 18d ago

Academic benchmark that tests how well AI agents remember information across separate task sessions. Compares memory approaches like keyword search, embeddings, and graph retrieval across shopping, travel, search, and reasoning tasks.

PythonOpenAI APIAnthropic APIGoogle AI APIOpenRouterBM25GraphRAGMem0setup: hardcomplexity 4/5

MemoryArena is the code release for an academic research paper that benchmarks how well AI agents remember information across multiple separate task sessions. The core research question is: if an AI agent completes a task in one session, and a later task depends on what it learned or did earlier, how reliably does the agent carry that memory forward? The paper introduces a suite of tasks designed so that sessions are interdependent, making memory a critical factor in performance.

The codebase is a Python framework with three main parts: agents (which take in a task and produce actions), environments (which execute those actions and return observations), and memory systems (which store and retrieve information between steps or sessions). The flow for each task step is: the memory system wraps the incoming task prompt with relevant stored context, the agent generates an action, the environment executes it, and the result is stored back into memory for future steps.

Several memory approaches are included so they can be compared against each other. These range from simply giving the agent a long context window, to retrieval systems based on keyword search (BM25) or text embeddings, to graph-based retrieval (GraphRAG), to third-party memory services (Letta, Mirix, Mem0). The benchmark environments cover web shopping, travel planning, web search, and formal reasoning tasks.

Running the code requires API keys for multiple AI providers (OpenAI, Anthropic, Google, OpenRouter) as well as separate keys for any third-party memory services used. Setup instructions for each environment are in separate markdown files in the repository.

The README describes this as a preview version that is still being actively maintained. No license is stated in the README.

Where it fits

Evaluate how different memory systems help AI agents carry knowledge from one session to the next
Compare retrieval approaches like keyword search, embeddings, and graph-based memory side by side
Test AI agent performance on realistic tasks like shopping and travel planning that require memory
Use as a starting point for building or improving memory systems for your own AI agent

Open on GitHub → Full breakdown on explaingit →