gitmyhub

spider-flow

Java ★ 11k updated 3y ago

新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

A no-code web scraping platform where you build scrapers by drawing a flowchart. Handles dynamic pages, proxy rotation, database storage, scheduling, and OCR without writing code.

JavaSpring BootSeleniumRedisMongoDBMySQLsetup: moderatecomplexity 3/5

spider-flow is a visual web scraping platform that lets you build scrapers by drawing a flowchart rather than writing code. You connect blocks in a diagram to define what the scraper should fetch, what data to extract, and where to store the results. The README is written in Chinese, but the features are documented in a structured list.

The platform can extract data from web pages using several methods: XPath (a way of selecting elements by their position in an HTML structure), CSS selectors (targeting elements by their styling class or ID), JsonPath (for JSON data), and regular expressions. It handles pages that load their content dynamically through JavaScript or AJAX requests, not just static HTML. Proxy support is included, and cookies are managed automatically.

Scraped data can be saved directly to a database using standard SQL operations (select, insert, update, delete), or written to files. Multiple database connections can be configured. A task monitoring panel and log viewer let you track what scrapers are running and what happened during each run. The platform also exposes an HTTP API so other systems can trigger scraper jobs programmatically.

A plugin system extends the core platform. Available plugins include Selenium (for browser automation), Redis (for caching or queuing), MongoDB, cloud object storage, an IP proxy pool, an OCR plugin for reading text from images, and an email plugin. Custom functions and custom executor plugins can also be written.

The project includes a disclaimer stating it should not be used for illegal purposes or in ways that violate websites' terms of service. It requires Java 1.8 or higher and is licensed under MIT.

Where it fits