pup

HTML ★ 8.4k updated 2y ago

Parsing HTML at the command line

pup is a command-line tool for pulling data out of HTML pages using CSS selectors. You pipe HTML into it, write a selector, and get matching elements back as text, attributes, or JSON.

Gosetup: easycomplexity 1/5

pup is a command line tool for extracting information from HTML pages. It reads HTML from standard input, applies filters you specify, and prints the results to standard output. The tool was inspired by jq, a popular utility for working with JSON data in the terminal, and follows the same pattern of piping data through filters.

The filters use CSS selectors, which are the same rules that web developers write to style pages. If you know how to write something like "select all links inside a table" in a stylesheet, you can write the same instruction for pup. This means you can grab specific elements by tag name, by CSS class, by HTML ID, by attribute value, or by their position among siblings. pup supports a broad set of these selectors, including more advanced ones like ":contains" for finding elements by text content.

Beyond returning matching HTML, pup offers a few output formats. You can extract just the plain text from matched elements, print the value of a specific attribute such as a URL or an ID, or convert the matched HTML into JSON. The JSON output includes the tag name, text content, and all attributes of each matched element, which makes it easy to pass the result into other tools like jq for further processing.

Installation is straightforward. You can download a prebuilt binary from the releases page, install it with Homebrew on a Mac, or build it from source if you have Go installed. Once installed, the typical workflow is to pipe the output of a tool like curl into pup, followed by a selector to pull out what you need.

This is a focused utility with no server component, no configuration files, and no ongoing setup. You run it, pass it HTML, and get structured output back.

Where it fits

Scrape specific data from a webpage by piping curl output through pup with a CSS selector.
Extract all links from an HTML page and pass them into another shell tool for further processing.
Convert HTML table content to JSON using pup's JSON output mode and then filter it with jq.
Pull attribute values or text content from HTML inside a shell script without a full scraping framework.

Open on GitHub → Full breakdown on explaingit →