webmagic

Java ★ 12k updated 6mo ago

A scalable web crawler framework for Java.

WebMagic is a Java library for building web crawlers that handles page fetching, link following, multi-threading, and data extraction so you only write the rules for what to collect.

JavaMavenXPathCSS selectorssetup: easycomplexity 2/5

WebMagic is a Java library for building web crawlers. A web crawler is a program that automatically visits web pages and collects information from them. WebMagic is designed to make writing a crawler in Java straightforward, handling the repetitive parts of the job so developers can focus on what data they want to extract.

The framework covers the full cycle of crawling: it fetches web pages, manages which URLs to visit next, extracts specific pieces of content from each page, and saves the results somewhere. It runs multiple threads at once so it can process many pages in parallel without requiring the developer to manage that complexity manually.

Developers interact with WebMagic in two main ways. The first is by writing a class that implements a provided interface, where you specify what links to follow and what data to pull from each page. The second is an annotation-based approach where you define a plain Java object and mark its fields with labels that describe how to extract each value from the page's HTML. Both styles are shown in the README with example code that crawls GitHub repository pages.

The extraction tools in WebMagic support XPath selectors (a standard way to pick specific elements from HTML), regular expressions, and CSS selectors. The library was influenced by a Python crawling framework called Scrapy, which inspired its overall architecture.

WebMagic is intended to be easy to integrate into existing Java projects. It is added as a dependency through Maven, the standard Java build tool. The project is licensed under the Apache 2.0 license, which allows free use in both personal and commercial projects. Documentation and additional examples are available on the project's website.

Where it fits

Scrape product listings or article content from a website into a structured Java object using annotation-based field extraction.
Build a multi-threaded crawler that follows pagination links across many pages and stores results automatically.
Extract specific HTML elements from pages using XPath or CSS selectors without writing custom HTML-parsing code.

Open on GitHub → Full breakdown on explaingit →