gitmyhub

requests-html

Python ★ 14k updated 2y ago

Pythonic HTML Parsing for Humans™

Requests-HTML is a Python library for fetching web pages and extracting data from them using CSS selectors or XPath, with support for JavaScript rendering via headless Chromium and async requests for scraping multiple pages at once.

PythonChromiumpipsetup: moderatecomplexity 2/5

Requests-HTML is a Python library for fetching web pages and pulling specific data out of them. It extends the popular requests HTTP library with the ability to parse the HTML that comes back from a web request, which is useful for scraping information from websites.

The library handles several things that make web scraping tricky. It automatically follows redirects, maintains cookies between requests, pools connections for efficiency, and sends a browser-like user-agent header so servers treat the requests as though they came from a real web browser. You get these behaviors without any extra configuration.

For extracting data from a page, the library supports two query styles. The first is CSS selectors, which work similarly to jQuery and let you find elements by tag name, class, ID, or combinations. The second is XPath, an older path-based query language that is more verbose but also more precise. Once you find an element, you can read its text, access its attributes, or pull sub-elements from it.

One notable feature is JavaScript rendering. Many modern websites load their content dynamically via JavaScript after the initial HTML arrives. Requests-HTML can run JavaScript by launching a headless Chromium browser in the background, waiting for it to finish executing, and then parsing the resulting page. This is an optional step you call explicitly when needed.

The library also supports async requests, meaning you can fetch several pages at the same time rather than waiting for each one to finish before starting the next. This speeds things up considerably when you need to scrape many URLs.

Requests-HTML is part of the Python Software Foundation's GitHub organization and was created by the author of the requests library. It is available via pip and targets Python 3.

Where it fits