gensim
Topic Modelling for Humans
Gensim is a Python library for topic modelling and document similarity over large text collections, it streams corpora instead of loading them into memory and includes word2vec, LDA, and LSA implementations.
Gensim is a Python library for the kind of natural-language-processing work that involves digging through enormous piles of text to find structure: discovering the hidden topics a collection of documents is about, indexing the documents, and looking up which ones are similar to a given query. The maintainers describe its audience as the natural language processing (NLP) and information retrieval (IR) communities.
The library is built around the idea that you should never have to load your whole corpus into memory at once. You hand Gensim a stream of documents, and its algorithms — including Latent Semantic Analysis, Latent Dirichlet Allocation, Random Projections, Hierarchical Dirichlet Process and the word2vec family of word-embedding methods — process them in chunks. There are efficient multi-core implementations of these algorithms, and Latent Semantic Analysis and Latent Dirichlet Allocation can also be run across a cluster of computers for very large jobs. Although Gensim itself is written in Python, the heavy lifting is delegated through NumPy down to optimised Fortran and C numerical libraries (BLAS), which is what lets it stay fast despite the high-level wrapper.
You would reach for Gensim if you have a large body of text — articles, support tickets, research papers, product descriptions — and want to figure out what themes run through it, build a "find me similar documents" feature, or train word vectors for a downstream model. It is installed with pip, depends on NumPy, and is currently in stable maintenance mode: bug fixes and documentation updates are still accepted but new features are not. The full README is longer than what was provided.
Where it fits
- Run LDA topic modelling on a corpus of support tickets or articles to automatically surface recurring themes.
- Train word2vec embeddings on a domain-specific text collection for use in a downstream NLP model.
- Build a find-similar-documents search feature over a large archive without loading the whole corpus into memory.
- Process a multi-gigabyte text dataset by streaming it through Gensim to avoid running out of RAM.