neuraltalk2
Efficient Image Captioning code in Torch, runs on GPU
NeuralTalk2 Explanation
This project automatically generates written descriptions of images. You give it a photo, and it outputs a sentence describing what's in it—like "a dog running in a park" or "two people sitting at a table." It's built to run fast on graphics processors (GPUs), making it practical for real-time use or processing large batches of images.
The system works by combining two types of neural networks. First, it uses a visual network (VGGNet) that learns to recognize objects and scenes in images. Then it feeds that understanding into a language network that generates descriptions word-by-word, similar to how a predictive text system works on your phone. This two-step approach—see, then describe—lets the model understand both what's visually present and how to talk about it naturally.
The main selling point compared to the original NeuralTalk is speed. By batching images together, using GPU acceleration, and other engineering improvements, training a good model takes just 2-3 days instead of weeks. The included pretrained model (trained on the MS COCO dataset, a large collection of photos with captions) performs well enough to rank around eighth on competitive benchmarks at the time of release.
You can use this in three ways. If you just want captions for your own photos, download the pretrained checkpoint and run the evaluation script on a folder of images. If you want to train your own model, the code walks you through preparing a dataset and running the training pipeline. The project also supports real-time video captioning if you add an additional computer vision library. One trade-off: there are many dependencies to install (Torch, CUDA, various libraries), though Docker is available to simplify setup.
The README notes that Google Brain released a similar but superior model around the same time, but this codebase remains useful as an educational Torch implementation and still delivers solid results.