gitmyhub

portia

Python ★ 9.5k updated 2y ago

Visual scraping for Scrapy

Portia is a visual, no-code web scraping tool where you click on page elements to teach it what data to extract, then it crawls similar pages automatically, runs locally via Docker, built on Scrapy.

PythonScrapyDockerJavaScriptsetup: moderatecomplexity 3/5

Portia is a visual web scraping tool that lets you pull data from websites without writing any code. You point it at a web page, click on the pieces of information you want to collect, and Portia figures out from those annotations how to extract the same kind of data from other pages that follow a similar structure. It is built on top of Scrapy, a Python-based web crawling library, but Portia is intended for people who do not want to write Python or deal with the technical details of a crawler.

The tool runs as a local web application that you access through a browser. The quickest way to get it running is via Docker: one command pulls the official image and starts the server on port 9001. You can also use Docker Compose by cloning the repository and running a single command from the project root. The documentation, hosted on Read the Docs, covers those steps in detail and describes alternatives for setups without Docker.

The README is brief and focuses almost entirely on getting the server started. It does not describe the full set of features, explain how annotation works in depth, or mention pricing. The project was created by Scrapinghub, a company that builds web scraping products and infrastructure, and Portia appears to be the open-source self-hosted version of a visual scraping product they also offered as a hosted cloud service. The README does not indicate whether the open-source version is still actively maintained or when it last received updates.

Where it fits