magika
Fast and accurate AI powered file content types detection
Magika is a Google tool that identifies what type a file really is by inspecting its contents with a small AI model, detecting over 200 file types in about 5 milliseconds on a standard CPU.
Magika is a tool from Google that figures out what kind of file something actually is — Python source, a Word document, a PNG image, a Dockerfile, and so on — by looking at the contents rather than trusting the extension. Knowing the real type matters for security, since the wrong assumption is how malicious files sneak past scanners.
What makes Magika unusual is that it uses a small AI model (deep learning) for the job instead of hand-written rules. The model is only a few megabytes, runs on a single CPU, and can identify a file in about five milliseconds. It was trained on roughly 100 million samples across more than 200 content types, both binary and textual, and reaches around 99 percent average precision and recall on the test set. Inference time stays nearly constant regardless of file size because Magika only inspects a limited slice. It offers prediction modes — high-confidence, medium-confidence, and best-guess — and falls back to generic labels like Generic text document or Unknown binary data when unsure.
You run Magika against one file, many files, or a directory recursively, and it prints the detected type for each, optionally as MIME types, plain labels, JSON, or JSONL. The command-line tool is written in Rust; there is also a Python package, a JavaScript/TypeScript package that powers an in-browser demo, and Go bindings in progress. Google itself uses Magika at scale to route files in Gmail, Drive, and Safe Browsing to the right scanners, and it is integrated with VirusTotal and abuse.ch. Reach for it for fast, accurate file-type identification in security pipelines, malware analysis, or any code that has to behave differently per file type.
Where it fits
- Identify the true type of user-uploaded files in a security pipeline to route them to the correct scanner.
- Detect disguised or misnamed malicious files during malware analysis workflows.
- Build a file processing app that behaves differently depending on whether an upload is a PDF, image, script, or archive.
- Replace extension-based file type guessing in a batch processing pipeline with accurate content-based detection.