gitmyhub

weibospider

Python ★ 4.8k updated 6y ago

:zap: A distributed crawler for weibo, building with celery and requests.

Weibospider is a distributed Python scraper for Weibo that collects user profiles, posts, comments, reposts, and keyword search results, storing data in MySQL with Celery workers coordinated via Redis.

PythonCeleryRequestsMySQLRedisDjangoYAMLsetup: moderatecomplexity 3/5

Weibospider is a distributed data-collection tool for Weibo, the large Chinese social media platform. It gathers public information including user profiles, original posts from a specific account's homepage, comments on posts, repost relationships, and posts matching a given keyword search. The README is written in Chinese, and the project targets researchers and developers working with Weibo data for analysis or natural language processing.

The system is built on top of two popular Python libraries: Celery, which handles task scheduling and distribution across multiple machines, and Requests, which handles the underlying HTTP communication. Data is stored in a MySQL database, and Redis is used to coordinate the Celery workers. The project explicitly avoids browser automation for login, relying instead on manually analyzed network requests, which the authors say makes the scraper more stable over long runs.

Setting the system up requires configuring a YAML file with your MySQL and Redis connection details, Weibo account credentials, and notification email settings. You then create the database tables, optionally start a small Django-based web interface for managing crawl targets, and launch one or more Celery workers. A separate Celery beat process handles periodic tasks such as refreshing login cookies, which Weibo invalidates every 24 hours.

Because it runs as separate workers, you can spread the load across multiple machines simply by installing the dependencies on each machine and pointing them at the same Redis and MySQL instances. The project includes rate-limiting controls in its configuration file, and the authors ask users to keep crawl frequency reasonable to avoid disrupting the Weibo platform.

Where it fits