hst-bench
Python
★ 0
updated 2d ago
HST-Bench evaluation dataset contains 753 agentic tasks along with the time taken by human annotators to solve each task. This dataset was collected as part of our ICML 2026 paper on Scaling Small Agents Through Strategy Auctions https//arxiv.org/pdf/2602.02751
No plain-English explanation yet — one is being written right now. Check back in a minute.