Xuehao Liu
YOLO: One stage object detection
Updated: Mar 14, 2019
There are two kinds of object detection: One stage and two stage. If we propose the potential areas that may contain objects, it is two stage. If we detect the regions of object directly from feature maps, it is one stage.
You Only Look Once(YOLO) is one of the one stage object detection method that people use a lot. It is straight forward and fast. It is faster than most object detection networks. After the first version of YOLO, the author published YOLOV2, YOLO9000, and YOLOV3. I will start from the most basic version of it.

The idea of YOLO is to simplify the object detection job to recognize the grids on the image. Those grids will be concerned as the center of the object's box. If we could label the grids, we could detect the object.
The image above shows the logic. If the grid is covered by a ground truth box, it will be labeled by that corresponding box. After that, the detail information of the box will be regressed in respect to the location of the grid. For each image the network will predict S*S(7*7) grids. For each grids, the network predicts 2 boxes. For each box, network predicts 4 coordinates, 1 confidence score, and c class scores. The confidence score tells whether this grid contains an object. The class scores tell us the probability for each class. 4 coordinates is (x, y, w, h). x and y are the center point related to the the bounds of the grid. w and h are the width and height of the box.
So right not we have the output size: S × S × (B ∗ 5 + C). In this case, we have 7 × 7 × (2 × 5 + 20). Because in the paper the author evaluate the PASCAL VOC dataset, which has 20 classes. In this case we proposed 7 × 7 × 2 = 98 boxes for an image. And for each box we will calculate whether it contains an object and the the probability for each class. It will be able to describe the boxes on the image. The next question is how could we end up with this output.
Here is the network structure of YOLO v1:

Comparing to the multi-scale anchor structure of SSD and RPN+ROI pooling structure of Faster-rCNN, the structure of YOLO v1 is the simplest one. It is quite deep but there is no concatenations or shortcuts or any other complex structures. It looks like another version of VGG-19. The only difference is the bound box and the probability.
The only thing I want to mention is the two fully connect layers at the end of the network. Other actions are just normal convolution and Maxpooling layers. The two fully connect layers is not like VGG, which is just flatten the feature map and fully connected with another fixed length layer. It is call fully convolutional layer. Right now we have a tensor which has the size of (batch × 7 × 7 × 1024). We do the convolution operation on it using a filter which has the size of (batch × 7 × 7 × 4096), with no padding. We will get a tensor with size of (batch × 1 × 1 × 4096). This is considered as one fully connected operation. And then we do another de-convolution operation on it using the filter with size of (batch × 7 × 7 × 30). Right now we have the output size that we want. This is the other fully connected operation. It is used in SSD as well.

This is the loss function of YOLO v1. It is quite intuitive. There are three parts in it: the coordinate loss, the object loss and the class loss. The coordinate loss is the sum of squared difference of x, y, square root w, and square root h. The object loss is the sum of object loss and the no-object loss. It will give a lower punishment if there is no object in the box. The class loss is the sum of square error of the probability of each class.