RetinaNet

From Canonica AI

Overview

RetinaNet is a focal loss based object detection model, designed to address the problem of class imbalance during training. It was introduced by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollár in 2017. The model is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone network is responsible for computing a convolutional feature map over an entire input image, while the two subnetworks are used for classifying objects and regressing their bounding boxes, respectively.

A screenshot of the RetinaNet model detecting objects in an image.
A screenshot of the RetinaNet model detecting objects in an image.

Architecture

The architecture of RetinaNet is unique in its design and implementation. It consists of a Feature Pyramid Network (FPN) backbone that generates a rich, multi-scale feature pyramid from a single-scale input. The FPN is followed by two subnetworks: a convolutional neural network (CNN) for object classification, and another CNN for bounding box regression.

Feature Pyramid Network

The FPN is a type of convolutional neural network designed to extract features from an image at multiple scales. It uses a top-down architecture with lateral connections to fuse high-level semantic information from lower resolution layers with high-resolution features from higher layers. This results in a feature pyramid that has rich semantic information at all scales.

Classification Subnetwork

The classification subnetwork is a fully convolutional network that predicts the probability of object presence at each spatial position for each of the anchor boxes and object classes. It is composed of four convolutional layers, each with 256 filters and 3x3 kernels, followed by a final convolutional layer that outputs the classification predictions.

Regression Subnetwork

The regression subnetwork is also a fully convolutional network, but it predicts the offset from each anchor box to a nearby ground-truth object, if one exists. Like the classification subnetwork, it is composed of four convolutional layers, each with 256 filters and 3x3 kernels, followed by a final convolutional layer that outputs the bounding box predictions.

Focal Loss

One of the key contributions of the RetinaNet paper is the introduction of a new loss function, called Focal Loss. This loss function is designed to address the problem of class imbalance during training, where the vast majority of anchors are easy negatives that contribute no useful learning signal. Focal Loss adds a modulating factor to the standard cross entropy criterion that down-weights the loss assigned to well-classified examples. This makes the model focus more on hard, misclassified examples.

Performance

RetinaNet has been shown to achieve state-of-the-art performance on several benchmark datasets, including COCO, PASCAL VOC, and Open Images. It is particularly effective at detecting small objects due to its use of a feature pyramid network.

Applications

RetinaNet can be used in a variety of applications that require accurate object detection, such as autonomous driving, surveillance, image recognition, and robotics. Its ability to detect objects at multiple scales makes it particularly useful in scenarios where objects of interest can be of various sizes.

See Also