clawbench
Python
★ 116
updated 10d ago
The agent benchmark that scores the full stack — harness, config, and model — not just the LLM. Trace-based scoring, reliability metrics, configuration diagnostics.
No plain-English explanation yet — one is being written right now. Check back in a minute.