gitmyhub

linguist

Ruby ★ 14k updated 2d ago

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!

Linguist is the Ruby library GitHub uses internally to detect programming languages in a repo, generate the colored language bar, and control syntax highlighting.

RubyCCMakeICUsetup: moderatecomplexity 3/5

Linguist is the Ruby library that GitHub itself uses to figure out which programming languages a repository contains. When you visit a project page on GitHub and see that colored breakdown bar showing something like 70% Ruby and 25% C, that data comes from Linguist. Beyond language detection, the library also helps GitHub ignore binary or vendored files, hide auto-generated content from diffs, and apply the right syntax highlighting.

You can install and run Linguist yourself as a Ruby gem. Because it relies on two compiled dependencies, one for character encoding and one for reading git history, you need some system packages installed first. The README lists the exact commands for macOS (via Homebrew) and Ubuntu, covering things like cmake, ICU, and OpenSSL. It also warns that the version of Ruby bundled with macOS often causes problems, and recommends using a separate Ruby install via Homebrew, rbenv, or a similar tool.

Once installed, a command-line tool called github-linguist works in two modes. Point it at a folder or git repository and it prints each detected language with its percentage and total byte size. Point it at a single file and it reports that file's type, MIME type, and detected language. You can run it against a specific git revision, like a tag or branch, so you can see how the language mix looked at any point in history.

Several flags adjust the output. One shows a per-file breakdown instead of just totals. Another reveals which detection strategy was used for each file, such as file extension, filename pattern, or heuristic analysis. You can also emit everything as JSON for use in scripts or other tools.

Projects can override Linguist's guesses through a .gitattributes file, forcing specific files or extensions to be counted as a different language. The README includes documentation links for deeper topics like how the detection works internally, how to configure overrides, and how to contribute fixes when a repo's language is being reported incorrectly.

Where it fits