openvenues ========== Open information extraction project for indexing and normalizing real-world venue/POI information from across the Web. Can be used standalone to extract venues from individual websites, or on a…
openvenues
==========
Open information extraction project for indexing and normalizing real-world venue/POI information from across the Web. Can be used standalone to extract venues from individual websites, or on a full-fledged copy of the entire Internet using the Common Crawl.
Project layout
- extract: the "easy way", extract structured (or at least semi-structured) address and geo data from HTML markup. Supports schema.org microdata, RDFa Lite, hcard, geotags, HTML5 `` elements, OpenGraph and extracting url params from Google map embeds
- jobs: Amazon Elastic Mapreduce jobs for extracting places from the Common Crawl (224TB or 3.6+ billion urls available on S3 as of August 2014, new crawls published periodically).
Notes
BeautifulSoup vs. lxml
The first version of the Common Crawl extraction job was written using lxml, a fast C library based on libxml2, for parsing. However, running said parser over billions of badly-encoded webpages revealed some bugs in lxml/libxml2 related to reading from uninitialized memory at the C level (see https://bugs.launchpad.net/lxml/+bug/1240696), which eats up all the system's memory and crashes the box. The bug occurs non-deterministically, so is hard to track down, but will occur, on different documents, if the job is run for long enough. Until there's a fix lxml won't be usable for this project. BeautifulSoup is a forgiving pure-Python regex-based "parser" designed for working with "tag soup". It's up to 100x slower than lxml, so we currently use a high-recall (not necessarily high-precision) regex to filter out documents that definitely don't contain the keywords we're looking for before committing to a full parse. With this filter, the job still completes in a reasonable amount of time using 100 8-core machines.Coming up next:
- Address extraction (find postal addresses in text)
- Deduping and normalization of venue names, addresses and locations
-
libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
C ★ 4.8k 1mo agoExplain → -
pypostal
Python bindings to libpostal for fast international address parsing/normalization
C ★ 880 7mo agoExplain → -
node-postal
NodeJS bindings to libpostal for fast international address parsing/normalization
C++ ★ 248 9mo agoExplain → -
gopostal
Go (cgo) interface to libpostal for fast international address parsing/normalization
Go ★ 184 2y agoExplain → -
ruby_postal
Ruby bindings to libpostal for fast international address parsing/normalization
C ★ 147 1y agoExplain → -
jpostal
Java/JNI bindings to libpostal for for fast international street address parsing/normalization
Java ★ 140 11mo agoExplain → -
php-postal
PHP bindings to libpostal for for fast international street address parsing/normalization
C ★ 133 3y agoExplain → -
lieu
Dedupe/batch geocode addresses and venues around the world with libpostal
Python ★ 84 4y agoExplain → -
openvenues
No description.
HTML ★ 24 11y agoExplain → -
common_crawl
Simple Python MapReduce jobs for processing the Common Crawl plus command-line utilities
Python ★ 11 11y agoExplain → -
address_languages
Frequent n-grams in OSM addresses by language. Helpful when contributing abbreviations to libpostal
★ 10 10y agoExplain → -
address-formatting ⑂
templates to format geographic addresses
Perl ★ 9 9y agoExplain → -
address_deduper
Flask app for use with address_normalizer in an ingestion setting
Python ★ 8 10y agoExplain → -
address_normalizer
DEPRECATED - use libpostal/pypostal instead
C ★ 7 11y agoExplain → -
chain_stores
Frequent venue names in OSM. Used to construct the libpostal chains dictionary.
★ 5 10y agoExplain → -
pgsql-postal ⑂
PostgreSQL binding for libpostal
C ★ 3 9y agoExplain → -
openvenues.io
openvenues website
HTML ★ 1 11y agoExplain → -
sparkey ⑂
Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.
C ★ 1 11y agoExplain →
No repos match these filters.