gitmyhub

awesome-crawler

★ 7.2k updated 2y ago

A collection of awesome web crawler,spider in different languages

A curated reference list of web crawling and scraping tools organized by programming language, browse it to find the right library before starting a data collection project.

setup: easycomplexity 1/5

awesome-crawler is a curated list of web crawling and web scraping tools organized by programming language. "Web crawling" means automatically visiting web pages to collect information from them. A "spider" or "crawler" is a program that follows links from one page to the next, gathering data along the way. "Web scraping" is the narrower task of extracting specific content from pages you have already retrieved.

The list covers tools for Python, Java, C#, JavaScript, PHP, C++, C, Ruby, Rust, R, Erlang, Perl, Go, and Scala. Each entry is a link to the project on GitHub or its official site, paired with a one-line description. For frameworks that have notable plugins or extensions, the list nests those under the parent tool.

Python has the most entries, which reflects how common that language is for data collection work. Scrapy is listed first in that section: it is a full framework for building crawlers, and several of its listed extensions handle distributing crawl jobs across multiple machines or managing crawl state with a database. Other Python entries cover lighter-weight options: simple HTTP clients, async crawlers, newspaper and article extraction libraries, and visual tools that let non-programmers define scraping rules through a browser interface.

The JavaScript section features Node.js crawlers as well as tools that use a headless Chrome browser to handle pages where content loads via JavaScript rather than arriving in the initial HTML response.

This repository is a reference guide, not software you install or run directly. It is most useful when starting a data collection project and wanting to survey what exists in a specific language before committing to a library. The list does not rank the entries or express a preference among them.

Where it fits