claude-code-vision-skill

Python ★ 44 updated 3d ago

为 Claude Code 赋能多模态视觉能力，支持豆包、通义千问、GPT-4o 等模型，用于截图 / UI / 图表分析；适配 DeepSeek 等无视觉底座，搭配 browser-harness 可做前端布局自动化检查。

Adds image-analysis capability to Claude Code setups that lack vision support, by routing image questions to an external vision-capable AI model such as GPT-4o, Qwen, or Doubao.

Pythonsetup: moderatecomplexity 2/5

Claude Code is a coding assistant that can run as the underlying AI model for various tasks. Some versions of Claude Code are backed by models that lack the ability to process images, meaning they can only work with text. This project adds image analysis capability to those setups by routing image questions to a separate AI model that does support visual input.

The idea is straightforward: when you have an image you want Claude Code to analyze, such as a screenshot, a user interface layout, or a chart, this skill passes that image to a vision-capable model and returns the result. The supported models come from three providers: Doubao (a Chinese AI service), Qwen (another Chinese AI service from Alibaba), and OpenAI's GPT-4o. You configure which one to use by setting an API key in an environment variable.

Installation is handled by a Python script that walks through the necessary steps: asking which provider you want to use, setting up the API key, and updating a configuration file that Claude Code reads when starting up. The configuration update inserts skill instructions into a global Claude Code settings file, with markers so the content can be replaced cleanly on future updates.

Once installed, the skill can be called from the command line with an image file and a question in plain text. You can also specify a particular provider at call time if you want to override the default. The README mentions that this skill is designed to pair with a separate browser-harness skill for checking the visual layout of web pages automatically.

The README is written primarily in Chinese, reflecting its intended audience. The project is small: one installation script, one vision script, and a skill definition file.

Where it fits

Analyze UI screenshots or charts inside a text-only Claude Code session by routing them to GPT-4o, Qwen, or Doubao.
Automatically check the visual layout of web pages by pairing this skill with a browser-harness skill.
Override the default vision provider at call time to switch between Doubao, Qwen, or GPT-4o for different tasks.

Open on GitHub → Full breakdown on explaingit →