gitmyhub

readability

JavaScript ★ 11k updated 5mo ago

A standalone version of the readability lib

Mozilla's JavaScript library that strips a webpage down to just its article text and images, the same code that powers Firefox's Reader View, available as a standalone package for your own projects.

JavaScriptNode.jsjsdomsetup: easycomplexity 2/5

Readability.js is the JavaScript library that powers Firefox's Reader View, the feature that strips a cluttered webpage down to just its article text and images. Mozilla has published it as a standalone package so developers can use it in their own projects without relying on Firefox itself.

The core idea is simple: you give it a web page's document, and it returns a clean article object. That object contains the article title, the cleaned-up HTML content, the plain text version (with all HTML tags removed), the author, the publication date, the language, and a short excerpt. One function call does most of the work.

The library runs in web browsers and also in server-side JavaScript environments like Node.js. In a browser you typically already have a document object to pass in. In Node.js you need a helper library to create one from raw HTML, and the README shows how to do that with a commonly used tool called jsdom.

There are a handful of optional settings you can adjust: how long an article must be before Readability bothers returning a result, whether to keep or strip CSS class names from the output, which video URLs to allow, and how to convert the final content to a string. A companion function called isProbablyReaderable gives a fast yes-or-no check on whether a page looks like an article at all, which is useful if you want to avoid running the full parsing logic on pages that are clearly not articles.

One important note from the README: the parsing step modifies the original document by removing elements. If you need the original page intact after parsing, clone the document first. The README also strongly recommends running the output through a sanitizer library before displaying it to users, since the library itself does not attempt to block malicious HTML.

Where it fits