detr

Python ★ 15k updated 2y ago ▣ archived

End-to-End Object Detection with Transformers

DETR, short for Detection Transformer, is research code from Facebook AI for end-to-end object detection. Object detection is the computer vision task of looking at an image and not just saying what is in it, but also drawing a box around each thing and labelling it (a dog here, a car there). The README explains that DETR replaces the usual hand-crafted detection pipeline (lots of stages and special-purpose tricks) with a single Transformer model that produces the whole set of boxes in one shot.

The README describes how it works in plain terms: DETR treats detection as a direct set prediction problem. A small fixed set of learned object queries are fed through a Transformer encoder-decoder, and the network reasons about how the objects relate to each other and to the global image content, then outputs the final predictions in parallel. A set-based global loss using bipartite matching makes sure each ground-truth object is matched with exactly one prediction during training. The repo states that this approach matches a well-known baseline detector on the COCO benchmark while using half the computation. It also notes that the inference logic can be written in about 50 lines of code.

You would use DETR if you are training or experimenting with object detection or panoptic segmentation models and want a clean, library-free starting point. The code is written in PyTorch (a Python deep learning framework named in the README), is installed via conda, and ships with pretrained models, Colab notebooks, and an optional Detectron2 wrapper. The full README is longer than what was provided.

Open on GitHub → Full breakdown on explaingit →