YOLO (You Only Look Once)

From Canonica AI

Introduction

YOLO, an acronym for "You Only Look Once", is a real-time object detection system. It is a system that uses a single neural network to divide the image into regions, and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities to detect objects.

Image of a computer screen displaying an image with multiple objects, each surrounded by a bounding box with a label indicating the object's name.
Image of a computer screen displaying an image with multiple objects, each surrounded by a bounding box with a label indicating the object's name.

Overview

YOLO was first introduced in a 2015 paper by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. The system is unique in its approach to object detection as it frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. Unlike other object detection systems that are targeted at achieving high accuracy while ignoring the time aspect, YOLO focuses on achieving a good balance between speed and accuracy.

Design and Functionality

YOLO uses a single Convolutional Neural Network (CNN) to simultaneously predict multiple bounding boxes and class probabilities for those boxes. The system divides the image into an S x S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S x S x (B*5 + C) tensor.

The confidence score for a box is defined as (Pr(Object) * IOU^truth_pred), where IOU^truth_pred is the intersection over union of the predicted box and the ground truth. If no object exists in that cell, the confidence scores should be zero. Otherwise, we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

Training

During training, YOLO uses sum-squared error between the predictions and the ground truth to calculate loss. The loss function composes of the classification loss, the localization loss (errors between the predicted bounding box and the ground truth), and the confidence loss.

The model predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other one. If the bounding box prior is not the best but does overlap a ground truth object by more than a certain threshold, we ignore the prediction. We also use a threshold to ignore bounding boxes that have a low IOU score to the ground truth.

YOLO uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for every grid cell, which means it has more context to predict the presence of small objects.

Performance

YOLO is significantly faster than other object detection systems. It can process images in real-time at 45 frames per second, while achieving double the mAP (mean Average Precision) of other real-time systems. On a Titan X GPU, YOLO processes 45 frames per second while Fast R-CNN, a popular object detection system, processes 18 frames per second.

However, YOLO has some limitations. It struggles with small objects that appear in groups, such as birds in the sky or people in a crowd. It also has more localization errors compared to other methods. Furthermore, YOLO is more likely to predict false positives for background patches.

Variants and Improvements

Since the original YOLO, several improvements and variants have been introduced. These include YOLOv2 (YOLO9000), YOLOv3, and YOLOv4.

YOLOv2, also known as YOLO9000, introduced several novel concepts. It uses anchor boxes to predict bounding boxes, and a new network architecture called Darknet-19. It also introduced multi-scale training, where the network is trained to predict objects on different scales.

YOLOv3 makes a few changes to improve training and increase performance. It uses three scales to make predictions, and each scale uses three anchor boxes. It also introduces a new feature extractor called Darknet-53.

YOLOv4 introduces several new techniques to improve speed and accuracy. It uses a modified version of Darknet called CSPDarknet53 for the feature extractor. It also uses PANet and SAM block for the detector part of the network.

Applications

YOLO has been used in a variety of applications. It is particularly useful in situations where real-time detection is important, such as in self-driving cars. It can also be used in video surveillance to detect and track objects, in retail to detect and track goods, and in many other applications.

See Also