gitmyhub

midscene

TypeScript ★ 14k updated 3d ago

AI-powered, vision-driven UI automation for every platform.

A TypeScript library from ByteDance that automates web browsers and mobile apps using plain-language instructions, you describe what to do in natural language and it finds the right element by looking at a screenshot.

TypeScriptJavaScriptPuppeteerPlaywrightPythonJavasetup: moderatecomplexity 3/5

Midscene is a TypeScript library that lets you automate web browsers, Android devices, and iOS devices using plain-language instructions instead of code that points to specific HTML elements. You describe what you want to do in natural language ("click the login button" or "fill in the username field"), and Midscene figures out what to interact with by looking at a screenshot of the screen.

The core idea is that it uses visual AI models to locate elements from screenshots rather than reading the page's HTML structure. This approach works on anything visible on screen, including web pages, mobile apps, desktop applications, and HTML canvas surfaces. Supported AI models include Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS, which is an open-source model from ByteDance that can be self-hosted.

For developers, the library offers three types of API calls: interaction methods for clicking, typing, and navigating; data extraction methods for pulling structured information out of a page; and utility functions like assertions and element locators. It integrates with existing browser automation tools Puppeteer and Playwright, and it also has a Bridge Mode for controlling a desktop browser session without writing a full automation script. Android support uses ADB, and iOS support uses WebDriverAgent.

A Chrome extension is available for trying out automation without writing any code. YAML is supported as an alternative to JavaScript for writing automation scripts, which may be more accessible for non-developers. A caching system replays scripts faster on subsequent runs by skipping the AI reasoning step when the page has not changed.

The project is licensed under MIT and maintained by the web infrastructure team at ByteDance. Community SDK ports exist for Python and Java.

Where it fits