crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
A web-based platform for running, scheduling, and monitoring web scrapers written in any language across multiple machines, upload your crawler code and manage everything from one dashboard.
Web crawlers are programs that automatically browse websites and collect data from them. Managing many crawlers at once across multiple computers is complex, and that is what Crawlab addresses. It is a platform for running, scheduling, and monitoring web crawlers built in any programming language, all from a central web dashboard.
You install Crawlab using Docker, a tool that packages software for easy setup. One computer acts as the main control point (called the master node), and any number of additional computers can serve as worker nodes that run the actual crawling jobs. This setup lets you spread heavy crawling workloads across many machines and scale up by simply adding more workers.
Through the web interface, you can upload crawler code, assign tasks to specific nodes, schedule jobs to run on a timed basis, and view results and logs for each run. The platform works with crawlers written in Python, NodeJS, Go, Java, and PHP, as well as specific popular crawling tools like Scrapy, Puppeteer, and Selenium. It does not care what technology your crawler uses internally, as long as it can run on the worker nodes.
Internally, the master and worker nodes talk to each other using gRPC, a framework for sending structured messages between programs across a network. Crawler files are synchronized across nodes using SeaweedFS, a distributed file system. Task data, scheduling information, and logs are stored in MongoDB, a database suited for this kind of unstructured data.
The quick start requires only Docker and a short configuration file to get a working local setup with a master node and two workers running together. Documentation is available in both English and Chinese.
Where it fits
- Run dozens of web scrapers in parallel across multiple servers and monitor logs and results from a single dashboard.
- Schedule a nightly data-collection job that distributes work across a pool of worker machines.
- Manage Scrapy, Puppeteer, or Selenium crawlers without writing your own job-queue or task-scheduling infrastructure.