MobileAgent

Python ★ 8.9k updated 1mo ago

Mobile-Agent: The Powerful GUI Agent Family

An Alibaba research project that builds AI agents capable of controlling Android phones, Windows, macOS, and browsers by visually reading the screen, give it a plain-English task and it taps, types, and clicks to complete it without any app integrations.

Pythonsetup: hardcomplexity 4/5

MobileAgent is a research project from Alibaba's Tongyi Lab that builds AI agents capable of operating mobile phones and computers by looking at the screen and taking actions, just as a person would. Instead of using APIs or special integrations with apps, these agents see the graphical interface visually and decide what to tap, type, or click in order to complete a task described in plain language.

The project has gone through multiple versions. The current line includes Mobile-Agent-v3.5, which works across Android phones, desktop operating systems (Windows and macOS), and web browsers. It is built on top of GUI-Owl-1.5, a family of AI models the team also released publicly, available in sizes ranging from 2 billion to 235 billion parameters. These models understand screenshots, can locate specific interface elements on screen, and can carry out multi-step tasks from a single instruction.

For longer or more complex tasks, the framework uses separate components for planning what to do next, tracking progress through a task, checking whether previous steps succeeded, and keeping relevant information in memory across steps. On standard benchmarks used to measure how well AI agents operate computers and phones, the project reports top results across more than 20 evaluation sets.

For people who want to try it without local setup, Alibaba provides online demos through ModelScope and its Bailian cloud platform, including a cloud-hosted Android phone you can control remotely. For researchers and developers who want to run it locally, code and model weights are available on HuggingFace and ModelScope. The project received best demo awards at Chinese computational linguistics conferences in both 2024 and 2025, and earlier versions appeared at NeurIPS 2024 and ICLR workshops.

Where it fits

Automate multi-step tasks on Android or desktop by describing what you want in plain English instead of writing scripts.
Run GUI agent benchmarks on Android, Windows, or macOS using state-of-the-art vision-language models from the GUI-Owl-1.5 family.
Try cloud-hosted Android phone control through Alibaba ModelScope without any local setup or GPU.

Open on GitHub → Full breakdown on explaingit →