Life-Harness
Offical implementation of "Life-Harness"
Life-Harness improves AI agent performance on tasks by adapting the code layer between the model and the task environment, no retraining needed, with an average 88.5% relative gain across 126 model-benchmark combinations.
Life-Harness is a research system that improves how AI agents perform on tasks without changing the AI model itself. The core idea is that when a frozen AI model repeatedly fails at a task, you do not need to retrain it. Instead, you can adapt the layer of code that sits between the model and the task environment, which the researchers call the runtime interface.
The system observes where an agent fails, then adds lightweight runtime adjustments in four areas: how model decisions are translated into actions the environment can execute, how the task's rules and constraints are made explicit to the model, how multi-step interaction sequences are regulated to prevent the model from repeating the same failure, and how successful recovery patterns from past runs are stored and reused. None of these changes touch the model's internal weights, and the benchmark environments used for testing remain unmodified.
The results reported in the paper cover seven different task benchmarks, ranging from household navigation and web shopping to database querying and operating system interaction. Across eighteen different AI model backbones, Life-Harness improved performance in 116 out of 126 model-environment combinations, with an average relative gain of 88.5%. The method requires no training.
The repository is structured in two parts matching two families of benchmark tasks: AgentBench-style tasks (which use Docker containers) and tau-bench-style tasks (which use a Python environment manager called uv). Each subfolder contains its own setup instructions. Users need to supply their own API keys for whatever AI model they want to test. The code accompanies a paper published on arXiv in 2026.
Where it fits
- Improve a frozen AI model's performance on tasks like web shopping or database querying without retraining it, just by adapting the runtime interface layer.
- Reproduce the Life-Harness arXiv 2026 paper results by running AgentBench or tau-bench tasks with your own AI model API keys.
- Study how storing and reusing successful agent recovery patterns from past runs prevents models from repeating the same failures.
- Test how different AI model backbones respond to runtime adaptations across seven benchmark task families.