hst-bench

Python ★ 0 updated 2d ago

HST-Bench evaluation dataset contains 753 agentic tasks along with the time taken by human annotators to solve each task. This dataset was collected as part of our ICML 2026 paper on Scaling Small Agents Through Strategy Auctions https//arxiv.org/pdf/2602.02751

No plain-English explanation yet — one is being written right now. Check back in a minute.

Open on GitHub → Full breakdown on explaingit →