DeskCraft

Python ★ 94 updated 16d ago

DeskCraft is a research benchmark of 538 tasks for testing AI agents that control a real Ubuntu desktop across 11 applications, including workflows where a simulated user interrupts and changes the request mid-session.

PythonUbuntu VMDockerHugging Facesetup: hardcomplexity 4/5

DeskCraft is a research benchmark for testing AI agents that can control a computer desktop by clicking, typing, and navigating software the way a human would. It accompanies an academic paper and is designed to measure how well these agents perform on real professional tasks, not just simple toy examples.

The benchmark contains 538 tasks that run inside a live Ubuntu desktop virtual machine. Tasks span eleven applications including LibreOffice Writer, Calc, and Impress, Chrome, VS Code, GIMP, Inkscape, Kdenlive, Audacity, and Blender, as well as multi-app workflows that require switching between programs. The tasks involve editing documents, manipulating images, writing code, processing audio or video, and operating at the level of the operating system itself.

What makes DeskCraft distinct from similar benchmarks is its focus on two kinds of difficulty. The first is professional depth: tasks are not simple one-step actions but longer workflows that produce files, exports, or other concrete deliverables. The second is interactive collaboration: 152 of the 538 tasks evolve during the session, where a simulated user interrupts, clarifies, or revises the request partway through. The benchmark uses scripted triggers to inject these follow-up messages at predictable moments, such as when the agent declares a phase complete or asks a question. This tests whether the agent can adapt rather than blindly executing its original plan.

Task success is verified automatically by checking the final state of the desktop, project files, exported artifacts, browser state, media metadata, or structured documents, depending on what the task requires.

Setting up the benchmark requires downloading a roughly 25 GB Ubuntu VM image from Hugging Face and configuring a desktop environment provider such as Docker. Researchers run agents through provided Python scripts and collect results in a local directory.

Where it fits

Measure how well your AI agent handles real professional desktop tasks like editing documents, writing code, and processing video across 11 applications.
Test whether your agent can adapt when a simulated user interrupts mid-task to clarify or change the original request.
Compare your agent's performance against the DeskCraft paper's baseline scores across the full benchmark suite.
Run automated desktop evaluations that check final file state, exports, and browser output rather than relying on screenshots.

Open on GitHub → Full breakdown on explaingit →