web100k
★ 0
updated 3mo ago
⑂ fork
A compact, no‑frills dataset of 100,000 real homepage HTML documents from popular domains. It’s meant for benchmarking / fuzzing / robustness testing of HTML parsers, link extractors, readability algorithms, ML preprocessing pipelines, etc.
No plain-English explanation yet — one is being written right now. Check back in a minute.