autoscraper

Python ★ 7.2k updated 1y ago

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

A Python library that learns to scrape websites from just a few examples of the data you want. Show it one sample value and it finds all matching items on the page, no HTML knowledge or CSS selectors needed.

Pythonsetup: easycomplexity 2/5

AutoScraper is a Python library that makes collecting data from websites much simpler than traditional scraping tools. You give it a web page address and one or a few examples of the data you want to pull, and it learns the underlying structure of the page to find similar items. Once trained, you can point it at other pages of the same type and it will return matching content without additional setup.

The core idea is that you do not need to inspect HTML source code or write custom rules for each website. You pick a sample: a post title, a stock price, a link. The library figures out where that type of content lives on the page and finds all items that match the same pattern. There are two modes: one that returns all similar items on a page, and one that returns the exact same fields in the exact same order each time, which is useful when you want consistent structured output from multiple pages.

Once a scraper is trained on a site, you can save it to a file and load it later, so you do not have to repeat the learning step. Custom request settings like proxy servers or HTTP headers can be passed in, which helps when a site requires specific configurations.

The tool is installable with pip and requires Python 3. The README provides short, working code examples covering common use cases: pulling related article titles from a forum, retrieving a live financial figure, and extracting metadata from a repository page. There is also a link to a tutorial showing how to combine AutoScraper with a web server to turn any website into a simple data API.

The project is open source and hosted on PyPI for easy installation. The README is concise and the examples cover the main functionality clearly.

Where it fits

Scrape all article titles from a forum by giving the library one example title, no CSS selectors or XPath needed.
Pull live financial data from a webpage and turn it into a simple API by combining AutoScraper with a web server.
Save a trained scraper to disk and reuse it on similar pages later without retraining.

Open on GitHub → Full breakdown on explaingit →