parser

JavaScript ★ 5.8k updated 1y ago

📜 Extract meaningful content from the chaos of a web page

A JavaScript library that takes a URL and returns just the article title, author, date, and clean body text, stripping ads, nav menus, and everything else a reader does not need.

JavaScriptNode.jssetup: easycomplexity 2/5

Postlight Parser is a JavaScript library that takes a URL and returns the meaningful content from that page as clean, structured data. Rather than getting back an entire web page filled with navigation menus, ads, and unrelated links, you receive only the parts a reader cares about: the article text, title, author name, publication date, a short excerpt, and the lead image URL. Fields the parser cannot find are returned as null.

The main use is stripping noise from articles so the content can be displayed in a cleaner reading view, stored in a database, or processed further by another tool. The library powers a browser extension called Postlight Reader, which applies this extraction in real time to give a distraction-free reading mode on any site.

You can request the extracted content in three formats: HTML (the default), Markdown, or plain text. Custom request headers can be passed along for pages that require cookies or a specific browser identity string. The parser can also work on HTML you have already fetched yourself, rather than fetching the URL on its own.

Sites often have unusual markup that causes generic parsing to fail. Postlight Parser addresses this by allowing custom extractors written with JavaScript and CSS selectors for specific domains. Many pre-built extractors for popular sites are included in the project, and contributors can add new ones by following a documented process.

A command-line tool is included alongside the library, so you can parse a URL from a terminal without writing any code. The library is dual-licensed under Apache 2.0 and MIT.

Where it fits

Build a read-later app that stores clean article text instead of full web pages.
Feed extracted article content into an AI summarizer or topic classifier without noise from ads and navigation.
Create a distraction-free reading view in a browser extension by stripping page clutter on the fly.
Scrape article metadata like author and publish date from a list of URLs for a content aggregation pipeline.

Open on GitHub → Full breakdown on explaingit →