clawbench

Python ★ 116 updated 10d ago

The agent benchmark that scores the full stack — harness, config, and model — not just the LLM. Trace-based scoring, reliability metrics, configuration diagnostics.

No plain-English explanation yet — one is being written right now. Check back in a minute.

Open on GitHub → Full breakdown on explaingit →