gitmyhub

UI-TARS

Python ★ 11k updated 4mo ago

Pioneering Automated GUI Interaction with Native Agents

An AI agent from ByteDance that looks at a computer or phone screen and performs tasks by clicking, typing, and navigating, you give it a goal in plain English and it works out the steps itself, on desktop, mobile, and web browsers.

PythonPyTorchHugging Facesetup: hardcomplexity 4/5

UI-TARS is an AI agent from ByteDance that can look at a computer screen or phone screen and perform actions on it, just as a human would by clicking, typing, scrolling, and navigating. The model is trained to understand what it sees visually and then figure out what actions to take to complete a given task. It can operate on desktop operating systems like Windows, macOS, and Linux, on mobile devices and Android emulators, and inside web browsers.

The core idea is that instead of writing code to automate a specific task, you give the agent a goal in plain language and it works out the steps itself. The model can reason through a problem before taking action, which makes it more capable on tasks that require multiple steps or where the right path is not obvious from the start. Version 1.5 is built on a vision-language model combined with reinforcement learning training, which is how it develops that reasoning ability. Version 2, also called UI-TARS-2, extends the same approach to cover games, code tasks, and tool use on top of the original GUI capabilities.

To use it, you deploy the model (the repository links to hosting options via Hugging Face) and then call it with a screenshot of the current screen along with a goal. The model returns a description of what action to take, such as clicking at a specific coordinate or typing a word. A post-processing library called ui-tars converts that output into executable code for controlling the mouse and keyboard. There is also a desktop application version in a separate repository for people who want to run the agent on their own machine without setting up the full deployment stack.

Benchmark results show it performing competitively against other AI computer-use systems on standardized tests for browser automation and desktop task completion.

Where it fits