tokenizers

Rust ★ 11k updated 1d ago

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

A fast Rust-based library from Hugging Face that converts text into tokens for AI models, with easy-to-use Python, Node.js, and Ruby wrappers installable in one command.

RustPythonNode.jsRubysetup: easycomplexity 2/5

Before a language model can read text, it needs to break that text into smaller pieces called tokens. Words get split into fragments, punctuation gets separated out, and special markers get inserted. This library, from Hugging Face, is the software that does that job. It supports the most widely-used tokenization methods in modern AI, including Byte-Pair Encoding, WordPiece, and Unigram.

The library is written in Rust, a programming language known for speed. The README notes it can process a gigabyte of text in under 20 seconds on a standard server CPU. Despite being written in Rust, you do not need to know Rust to use it. Hugging Face provides ready-made wrappers for Python, Node.js, and Ruby, and the Python package is installable with a single pip command.

Aside from splitting text into tokens, the library handles the surrounding preparation steps that AI models require: padding sequences to a fixed length, truncating sequences that are too long, and inserting any special tokens a particular model expects. It also tracks alignment, meaning you can trace any token back to exactly where it appeared in the original input text, which is useful when you need to highlight specific spans in the original sentence.

You can either train a new tokenizer vocabulary from scratch on your own text files, or load a pre-built vocabulary. The Python API keeps both options to just a few lines of code.

Hugging Face created and maintains the library. It is used across their broader ecosystem of AI tools and is intended for both research and production deployment.

Where it fits

Prepare text data for a language model by tokenizing it with BPE, WordPiece, or Unigram methods.
Train a custom tokenizer vocabulary on your own text dataset in just a few lines of Python.
Process a gigabyte of text in under 20 seconds for large-scale dataset preparation.
Trace any token back to its exact position in the original text when building NLP pipelines.

Open on GitHub → Full breakdown on explaingit →