GloVe

C ★ 7.2k updated 10mo ago

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

A Stanford research tool that converts words into numerical vectors so computers can understand language mathematically, pre-trained vectors for Wikipedia, Common Crawl, and Twitter are ready to download and use without any training step.

CPythonsetup: moderatecomplexity 3/5

GloVe, which stands for Global Vectors for Word Representation, is a research project from Stanford University that turns words into lists of numbers so that computers can work with language mathematically. Each word in a vocabulary gets assigned a vector, which is just a fixed-length sequence of decimal numbers. The key property of these vectors is that words with similar meanings end up close to each other in the mathematical space. The classic example is that the vector for "king" minus the vector for "man" plus the vector for "woman" comes out close to the vector for "queen."

This kind of word representation is called a word embedding, and it was one of the foundational techniques in natural language processing before the era of large language models. Many machine learning systems that work with text still use or have historically used these vectors as a starting point.

The repository offers two ways to use GloVe. The first is to download pre-trained vectors that Stanford has already computed from large text collections. Options include vectors trained on Wikipedia, a large web crawl called Common Crawl (which covers billions of web pages), and Twitter. A 2024 update added vectors trained on the Dolma dataset, which is a 220-billion-word open-source text collection. These pre-trained files can be downloaded and used directly in other projects without any training step. The second option is to train your own vectors on a custom body of text, which is useful when the domain-specific language in a field differs significantly from general web text. The training code is written in C and runs from the command line.

Stanford added updated pre-trained vectors in 2024 and published a report analyzing their quality. The project is licensed under the Apache 2.0 license, which allows use in commercial and open-source applications.

Where it fits

Download pre-trained GloVe vectors and use them as input features for a text classification or sentiment analysis model.
Train custom word vectors on a domain-specific corpus where specialized vocabulary differs from general web language.
Measure semantic similarity between words or perform word analogy tasks using the vector arithmetic properties.

Open on GitHub → Full breakdown on explaingit →