gitmyhub

magika

Python ★ 17k updated 9d ago

Fast and accurate AI powered file content types detection

Magika is a Google tool that identifies what type a file really is by inspecting its contents with a small AI model, detecting over 200 file types in about 5 milliseconds on a standard CPU.

PythonRustJavaScriptTypeScriptGodeep learningsetup: easycomplexity 2/5

Magika is a tool from Google that figures out what kind of file something actually is — Python source, a Word document, a PNG image, a Dockerfile, and so on — by looking at the contents rather than trusting the extension. Knowing the real type matters for security, since the wrong assumption is how malicious files sneak past scanners.

What makes Magika unusual is that it uses a small AI model (deep learning) for the job instead of hand-written rules. The model is only a few megabytes, runs on a single CPU, and can identify a file in about five milliseconds. It was trained on roughly 100 million samples across more than 200 content types, both binary and textual, and reaches around 99 percent average precision and recall on the test set. Inference time stays nearly constant regardless of file size because Magika only inspects a limited slice. It offers prediction modes — high-confidence, medium-confidence, and best-guess — and falls back to generic labels like Generic text document or Unknown binary data when unsure.

You run Magika against one file, many files, or a directory recursively, and it prints the detected type for each, optionally as MIME types, plain labels, JSON, or JSONL. The command-line tool is written in Rust; there is also a Python package, a JavaScript/TypeScript package that powers an in-browser demo, and Go bindings in progress. Google itself uses Magika at scale to route files in Gmail, Drive, and Safe Browsing to the right scanners, and it is integrated with VirusTotal and abuse.ch. Reach for it for fast, accurate file-type identification in security pipelines, malware analysis, or any code that has to behave differently per file type.

Where it fits