crawlee

TypeScript ★ 24k updated 1d ago

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

A Node.js library that automates web scraping, visiting websites, extracting data, rotating proxies, and managing large URL queues so you don't get blocked.

TypeScriptNode.jsPlaywrightPuppeteersetup: moderatecomplexity 3/5

Crawlee is a web scraping and browser automation library for Node.js. Web scraping means automatically visiting websites and extracting information from them — like prices, product listings, article text, or any other data you can see in a browser. Crawlee makes this easier by handling the repetitive, technical work for you.

The problem it solves is that scraping modern websites is hard: pages load content using JavaScript, websites detect and block automated requests, and managing a queue of thousands of URLs while handling errors and retries gets complex fast. Crawlee handles all of this. It can control real browsers (via Playwright or Puppeteer) to scrape JavaScript-heavy sites, or use fast HTTP requests for simpler pages. It automatically rotates proxies to avoid blocks, generates realistic browser fingerprints to appear human-like, manages a queue of URLs to visit, and saves collected data to disk or cloud storage.

You would use this if you need to extract data from websites at scale — for example, to build a price comparison tool, aggregate news articles, collect training data for AI, or monitor competitor websites. It works in JavaScript and TypeScript and runs on Node.js. It is developed by Apify, a company that provides cloud infrastructure for running scrapers, though Crawlee itself runs anywhere.

Where it fits

Build a price comparison tool that scrapes product listings from multiple retail websites automatically
Collect news articles from dozens of sites automatically for a content aggregation feed
Gather training data for AI models from public web pages at scale
Monitor competitor websites and alert you when prices or content change

Open on GitHub → Full breakdown on explaingit →