MediaCrawler

Python ★ 52k updated 3d ago

小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频｜评论爬虫、微博帖子｜评论爬虫、百度贴吧帖子｜百度贴吧评论回复爬虫 | 知乎问答文章｜评论爬虫

Python tool for scraping public posts, videos, and comments from major Chinese social media platforms like TikTok, Xiaohongshu, Bilibili, and Weibo using browser automation.

PythonPlaywrightFastAPINode.jsChrome DevTools Protocolsetup: moderatecomplexity 3/5

MediaCrawler is a Python tool for scraping publicly available content from major Chinese social media platforms. It supports collecting posts, videos, and comments from Xiaohongshu (RedNote), Douyin (Chinese TikTok), Kuaishou (a short-video app), Bilibili (a video platform), Weibo, Tieba (a Chinese forum site), and Zhihu (a Q&A platform similar to Quora). The tool can search by keyword, crawl specific post IDs, fetch comments and replies, and pull content from specific creator pages.

The core technical approach relies on browser automation using Playwright, a library that controls a real web browser programmatically. Instead of manually reverse-engineering each platform's API encryption — which is complex and fragile — the tool logs into the platform through the browser, maintains the authenticated session, and then uses JavaScript within that browser context to extract the signed request parameters. This avoids the need to crack encrypted API signatures, making the tool easier to maintain. By default it connects to an already-open Chrome browser using the Chrome DevTools Protocol (CDP), which lets it reuse your existing login state and cookies and reduces the chance of the platform detecting automated activity.

The README carries a clear disclaimer stating the tool is intended for learning and research only, not commercial use or large-scale scraping, and links to documented cases of illegal scraping activity in China.

You would use this repository if you are a researcher studying social media trends, a data analyst gathering public sentiment data from Chinese platforms, or a developer learning how browser-based scraping works. Data can be exported to CSV, JSON, Excel, SQLite, or MySQL.

The tech stack is Python (3.11 recommended) with Playwright for browser automation and Node.js as an optional dependency for JavaScript execution. A simple web UI is also included, built with a FastAPI backend.

Where it fits

Collect public posts and videos from Chinese social media platforms for research or trend analysis.
Gather user comments and sentiment data from multiple platforms to understand audience reactions.
Learn how browser-based web scraping works by studying the Playwright automation approach.
Export social media data to CSV, JSON, or database formats for further analysis.

Open on GitHub → Full breakdown on explaingit →