web100k

★ 0 updated 3mo ago ⑂ fork

A compact, no‑frills dataset of 100,000 real homepage HTML documents from popular domains. It’s meant for benchmarking / fuzzing / robustness testing of HTML parsers, link extractors, readability algorithms, ML preprocessing pipelines, etc.

No plain-English explanation yet — one is being written right now. Check back in a minute.

Open on GitHub → Full breakdown on explaingit →