gitmyhub

self-operating-computer

Python ★ 10k updated 9mo ago

A framework to enable a multimodal model to operate a computer.

Self-Operating Computer is a Python framework that lets AI vision models control your computer by looking at the screen and issuing mouse and keyboard actions to complete goals you describe in plain English.

PythonGPT-4oClaude 3GeminiLLaVaOllamasetup: moderatecomplexity 3/5

Self-Operating Computer is a Python framework that lets AI models control a real computer the same way a human would: by looking at the screen and deciding what to click or type. You give it a goal in plain English, such as "open the browser and search for the weather in London", and the AI takes screenshots, figures out where things are on screen, and issues mouse and keyboard actions to complete the task.

The system connects to vision-capable AI models to do its work. By default it uses GPT-4o, but it also supports Google Gemini Pro Vision, Claude 3, Qwen-VL, and a locally-run open-source model called LLaVa via Ollama. Each model looks at a screenshot of your screen and decides what action to take next. Installation is a single pip command, and you start it by typing the word operate in your terminal.

Several modes change how the AI identifies where to click. The default OCR mode uses text recognition to build a map of clickable elements and their positions, which the README describes as the most accurate approach. A Set-of-Mark mode uses a small object-detection model to label buttons and interface elements directly on the screenshot. There is also a voice input option that lets you speak your objective rather than type it.

The framework was released in November 2023 and the README describes it as one of the first public examples of an AI system doing full computer control. It works on Mac, Windows, and Linux. On Mac, you need to grant the Terminal app screen recording and accessibility permissions in System Preferences before it can see your screen or move the mouse.

The project requires an API key for whichever AI model you choose to use. It is open source and accepts contributions through the GitHub repository.

Where it fits