Xuehao Liu

# Traditional object detection/image classification method

Updated: Feb 7, 2019

We now know that if we want to analyze images, it is better to train a CNN. It is powerful and you do not even have to know what is going on, and you may still get a pretty decent number. But, before the birth of CNNs, how people analyze images?

The answer is the **gradients**. And it is intuitive. We can recognize a object just because the color of the contour of an object is different from the back ground. And that means the gradient of the color is large. Basing on that there are many methods are proposed. I classify those methods into two categories:

Pure gradient

Haar feature

Before we go into any details of any methods, I need to point out that the traditional way of classification/detection is to **(1)**extract the feature; and **(2)**push the feature into a classifier. All of them is following this logic. Some methods may have a better feature extraction. Some of them may use a different kind of classifier such as SVM or AdaBoost. In the end, all the classifier will not make a big difference on the final score unless you can do some improvement on the classifiers. But this improvement is another story. It is not the main focus of object detection.

**SIFT(**Scale-invariant feature transform**)**

This must be the most well-known object detection method. It is basing on the gradient.

The whole idea is to find the key points of two images, and match them. The advantage of this method is that it is Scale-invariant! It means that if you could twist the object, rotate it, or scale it. The system will still find the key points.

I put this as the first method because this method is the earliest one. And it is not a just a key point finder. It is a discriptor. You can use this idea to detect whatever object you want. There are people using SIFT to classify car logo. Other people can use PCA on SIFT. SIFT can also be used on a motion detection. It is a 3D SIFT. But of course right now people will not use this as a main method of motion detection.

So how does it work?

First we need to find the key point. We can follow these steps:

Blur the image in different degrees.

Scale the image in different degrees.

Do the subtraction between neighbor degrees in the same scale degree.

Find the maximum or minimum in the subtraction result.

In this way, the feature is Scale-invariant. It will look like this:

So now we have the key points, what is the descriptor?

For every key point, we can explore 16*16 pixels' gradient around it. And calculate the gradient in 4*4 form. After that, we can give a 8 bin gradient histogram around the key point.

This may be too fast. You can go to __this__ website for more information.

**Viola-Jones**

VJ has another logic with SIFT. It is basing on the sliding windows. And it detects the Haar features.

So what is Haar feature?

Haar feature is a kind of filter. This area extracting features is rather flexible. The calculation is just the sum of white area subtracts the sun of black area. These black and white area cover the whole sliding window.

So what is the next step right now? For every sliding window we calculate the Haar feature. You may say for every sliding window, we calculate the sum of pixels cost to mach time. However we only need to do the sum operation once. This is using the concept of Dynamic Programing. Here is the overview of this method.

We do this calculation on the image: For every pixel in the picture, it should store the sum of the pixel's value in its left-up area. So if we want the sum of area D, we just need to do 4-2-3+1 in the picture above. It is a O(1) time complexity. And you may say we still need O(n) time to calculate the number in 4 or 2, 3, and 1. However, we do not. It is still a O(1) time complexity for every pixel. Here is how we do that:

The above is a mimic of original pixel number in the storage. And we sum the current pixel, the pixel above it, and the pixel on the left and minus the left and above one. We do this calculation for every pixel. We have this:

This is what we actually use in the calculation of Haar feature. For an image, it is a O(n) time complexity. For every pixel, It is a O(1) time complexity.

So right now we have the Haar feature, it can represent something on the image. We can push this feature into a AdaBoost classifier. The classification strategy here may not need us to calculate all the Haar features. We may just calculate one or two features, and get a value, and move on to the next level of AdaBoost.

So finally after training, we will get many bounding boxes. They all contain the object. Now we can use NMS, set a threshold, and get less boxes.

The whole process is quite intuitive. Although we used some strategy to reduce our calculation time, we may still need a large amount of time to do the detection. But this V-J method is relatively fast, with a high detection rate. Moreover it is robust on many detection task, such as face detection and human detection.

**HOG**(Histogram of oriented gradients)

Hog is calculated basing on the gradient as well. The main flow is (1)calculate the Gradient in a small area; (2) concatenate the gradient features in the sliding window to get a larger feature vector; (3) push this larger vector into a SVM.

The first step: (1)calculate the Gradient in a small area;

You can see the gradient on every pixel. After that we will split the whole picture into 16*16 small areas. Like this:

You see the histogram is a 9 dimension long vector. Each dimension stores the gradient on that direction in that 16*16 area. And finally we will have this:

And this:

So right now we can concatenate every 9-d vector in a sliding window and push them into a SVM, and train it.

**DPM**(Discriminatively Trained, Multiscale, Deformable Part Model)

To me, DPM is like a better version of HOG.

It still use the logic of HOG. It calculate the histogram of gradient in a small area, and push it into a SVM.

The improvement of DPM is that it separate the whole object into small parts. For example, It will detect the two legs, two arms, one main body, and one head first. And then, if these components are in the right position, it declares that it is a human being.

The model for an object with n parts is formally defined by a root filter F0 and a set of part models (P1, . . . , Pn) where Pi = (Fi , vi , si , ai , bi). Here Fi is a filter for the i-th part, vi is a two-dimensional vector specifying the center for a box of possible positions for part i relative to the root position, si gives the size of this box, while ai and bi are twodimensional vectors specifying coefficients of a quadratic function measuring a score for each possible placement of the i-th part.

This is the modeling description I got from the DPM __paper__. And the detection situation will be like this:

DPM surely is the best object detection method before the CNNs.