gitmyhub

crawlee-python

Python ★ 9.2k updated 1d ago

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Parsel, BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

A Python library for building web scrapers that visit websites and collect structured data, with built-in support for JavaScript-heavy pages via a real browser, proxy rotation, and bot-detection evasion.

PythonBeautifulSoupPlaywrightParselsetup: easycomplexity 2/5

Crawlee for Python is a library that lets you build programs to automatically visit websites, collect information from them, and save that information in a structured format. If you have ever wanted to pull data from a website without doing it by hand, this is the kind of tool that handles that work for you.

The library gives you two main ways to crawl. The first uses a simple HTTP approach paired with a parser called BeautifulSoup, which is fast and works well for pages where the content is already present in the HTML source. The second uses a real browser running in the background, controlled through a tool called Playwright, which is better for pages that build their content using JavaScript after the page loads. You can also use Parsel or raw HTTP if your project has different needs.

A key feature is that Crawlee tries to make your crawlers look like regular human visitors rather than automated bots, which helps them work reliably against sites that normally block automated requests. It also handles proxy rotation, meaning it can send requests through different network addresses to further reduce the chance of being blocked.

Setting up is straightforward. You install the package from PyPI, choose which extras you need (for example, adding Playwright support), and write a short script that tells Crawlee which URLs to start from and what data to collect. There is also a command-line tool that generates a starter project for you from a template, which can speed things up if you are new to the library.

The data you collect gets saved automatically to a local storage folder in a format you can open and process further. Common uses include gathering training data for AI models, building datasets for language model applications, and pulling product or research information from the web at scale. There is also a TypeScript version of the same library available separately for projects not using Python.

Where it fits