gitmyhub

pattern

Python ★ 8.9k updated 2y ago

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A Python library that combines web scraping, natural language processing, and machine learning in one package. Collect data from websites and social APIs, analyze text sentiment and grammar, and classify documents without installing many separate tools.

PythonWordNetsetup: easycomplexity 2/5

Pattern is a Python library for pulling information from the web and making sense of it. It combines several different capabilities in one package: scraping content from websites and web services, analyzing the natural language in that content, running basic machine learning on the results, and visualizing how things connect to each other in a network.

The web mining part can talk to services like Google, Twitter, and Wikipedia through their APIs, and includes a general-purpose web crawler and an HTML parser for extracting structured data from pages. The language processing part can identify parts of speech in text, such as whether a word is a noun or adjective, perform sentiment analysis to guess whether a piece of text sounds positive or negative, and look up word relationships through WordNet, which is a database of how English words relate to each other.

The machine learning tools cover common techniques: vector space models for representing documents as numbers, clustering for grouping similar items together, and classification algorithms including K-Nearest Neighbors, Support Vector Machines, and a Perceptron. The README includes a worked example that collects tweets tagged with #win or #fail, pulls out the adjectives using the part-of-speech tagger, and trains a classifier to predict which category a new tweet belongs to.

Pattern supports both Python 2.7 and Python 3.6 and can be installed with pip. It bundles its own copies of several algorithms and data sets, so it does not have many external dependencies.

The project comes from academic research and has an associated paper in the Journal of Machine Learning Research. It is BSD-licensed and was developed at a university research group, with contributions from many people over the years.

Where it fits