Faster-rcnn: Basic structure for two-stage object detection
Updated: Mar 14, 2019
We cannot do the convolution directly on thousands of regions on the raw image to do the object detection. It takes too much running time. So normally people will detect the object on a CNN feature map. We let the network propose n regions on the feature map to predict. This is the way of two-stage object detection. Faster-rcnn is the first real-time two-stage object detection network.
This is the overall structure of Faster-rcnn. It has some similarity with SSD. They all use the CNN feature map as the input of the network, and they all push the box information into non-maximum suppression(NMS) to reduce the redundancy. However, unlike SSD, its region proposal network(RPN) does not propose regions from multiple levels of feature maps. Also, Faster-rcnn has fully connect(fc) layers, which is not in SSD. It is consisted with two parts: RPN and ROI features. They are two separate networks. RPN provides the potential regions that may contain objects. ROI features recognize the objects and give a final location of the objects.
Region proposal network(RPN)
The whole Faster-rcnn roughly is built on the Fast-rcnn. The only difference is Faster-rcnn have the RPN before pooling the region of interest(ROI) feature.
This is the structure of RPN. The detection boxes are determined by the anchor on the feature map. An anchor is set for every point on the VGG feature map. Of course you can use other CNN's feature maps. It will be pushed into a sliding window, which is a 3*3*256 kernel. To me, it seems that this sliding window is just another conv layer. After that, this 256*Wide*Height layer will be flatten and fully connected to two layers. One is for classification. It will tell whether this box contains an object. For every anchor, there will be 2k scores, because there are k boxes for an anchor. The other is for the coordinates. It will have 4k coordinates, because the first two number is the up-left point of the box, and the rest two are the width and height of the box.
One major contribution of this RPN is that with the fully connected layers, it is translation invariant. It means that if the object is shown in another format in the same location, the region propose result should be the same. Comparing this to other networks such as Multibox, which is using k-means to propose regions, it is faster and more reliable. Because k-means is a really slow method, and RPN only needs one time of forward pass in test time. In the paper the author says that RPN has less parameters than Multibox. The author compares the parameters of RPN with the whole Multibox network. Of course RPN has less parameters. I think maybe we should not compare those two networks in that way. However, RPN indeed is a quicker method and it is better than k-means.
Before we go further into the loss function, only Intersection over-Union (IoU) larger than 0.7 will contribute to the loss function. Also boxes with IOU less than 0.3 will be assigned negative label. Other boxes will not influence the loss function.
The loss function is quite intuitive. It is the sum of the classification loss and the box regression loss. The classification loss is the log loss. The box regression loss is here:
The box regression loss is the smooth L1 loss. It will be explained later.
Regions of interest (RoIs) feature
After we roughly know where the objects are, we could go one step further: We can give a precise location and which class this object belongs to.