In my previous article, I created a live face detection with gender recognition using 2 different models. We used an OpenCV provided cascade classifier to detect faces, and used our trained Neural Network to classify detected faces. But cascade classifier is very weak and detects faces only in front view. We can combine more models for side views, but still, it will not be as good as a NN based approach. Now a simple way to fix this would be to use a more powerful model to detect faces and use the same gender classifier to classify the detected faces, but this type of approach doesn’t always work especially in complex scenarios like object detection. In this article, we will look at the possible solution and how we can create a single NN structure for such scenarios.

Classifier based detection approach

We can very well solve our problem using 2 Neural Network models, one trained to detect faces and others to classify them. But first, let’s think about how our first model will work. How will it detect faces?

Sliding window classifier

A simple and previously popular method is to actually train a model to classify image (small patch of the image to be exact) as face vs non-face. Let’s assume we trained the model to classify images of size SxS. To convert this classifier into a detector, we use this classifier on various parts of the image. We start looking from the top left corner and scan every SxS part of the image to check whether it contains a face. Now we know that wherever our model classifies the image as a face, there is a face (with some probability). But what if the face in the image is bigger than SxS (our classifier size). To detect them we resize the image to a smaller scale (after adding a bit of gaussian blur for some other reasons) and do the same process as above. We repeat this for different scales to detect faces of different sizes.

Note: As the name suggests, OpenCV cascade classifier was infact a classifier approach which we used to detect faces

Problems with sliding window

Since a face can be at any location, and the classifier has to be on the face to detect it, we have to run classifier for every possible location. So if we have a 320x320 image and our classifier works on 20x20 image, we have to run the classifier (320-20)x(320-20) times for each scale. If we try 3 different scales, we are running our classifier for 270000 times!!! Even then we might miss some faces as we are checking on only 3 scales. You can clearly see why this doesn’t work for real-time systems. We might get it to work for our specific use case if we use very simple models as the classifier, but it clearly doesn’t work with complex models for complex tasks like object detection.

Classifier with region selection

A solution to the above problem is to use another model or a smart algorithm to find the areas of interest and run our classifier to only these areas. So instead of running our classifier on every possible location, we use a model or some heuristics to detect the possible locations which might have an object (or whatever we are trying to detect). This significantly reduces the number of times we have to run our classifier, but this introduces another model in the pipeline. Now we not only have to have a good classifier but also a good and finetuned method to find regions to run our classifier on.

One of the popular algorithms which use a similar method for object detection is called R-CNN. R-CNN uses a selective search to get “region proposals”. It then uses a Convolutional Neural Network to classify the objects in proposed regions. This is followed by some post-processing models to finetune the predictions. This is just a rough overview of the approach used, for details feel free to read the paper.

Problem with region selection

As mentioned above, we are introducing another model or algorithm that we need to train or finetune. No matter how good our classifier is, it won’t detect anything unless our region selection method selects the appropriate regions to check. Also, since we need to run multiple models for each image, its difficult to achieve real-time performance unless the models are very fast. Although this method is faster than previous brute force approach, it is still not enough for real-time applications when using complex and time-consuming models.

The Solution

The solution to the above problems seems very obvious, just have one model do everything. That way we can optimize the model directly to detect and classify objects without optimizing and integrating each part separately. And this also makes it faster, since instead of different models, we need to run only one model which will save time and hence will be faster (unless that single model is computationally more expensive than multiple models combined). But how exactly can we do that?


If the answer was that obvious we wouldn’t be struggling here. This is where YOLO comes in. YOLO does exactly that, it optimizes a single NN model for the final output, that is, location as well as the class of objects. It takes at the problem with a different perspective, and once you understand their approach it will seem so obvious. The model is very simple. There are 3 main points in the model:

  • Instead of classification (like we were doing with moving window and region selection), look at the problem as a regression problem: given an image, we want to regress the bounding box (center and size) for objects. It is that simple!
  • To simplify the problem and have a fixed output, divide the image into a grid. Each grid cell will predict B bounding boxes with a confidence c. Each cell is responsible to only regress bounding boxes for objects which have its center in this particular grid cell. We will only consider bounding boxes which have high confidence. So grid cells which don’t have any object centers are free to give any B bounding boxes as long as the confidence for each bounding box is close to 0.
  • Along with bounding box regression, each will have a C classification outputs for each cell which will give conditional probabilities for each class. That is, given there is an object what is the probability that it is of class x. You can also consider these as regression where you are trying to regress probabilities, it just depends on the way you look at it.

That’s it! it is that simple. Of course, there are more details to the model, but these are the key ideas that YOLO introduces. Feel free to read the paper here. The paper is simple to understand, but if you prefer a less formal explanation just google it. YOLO is a breakthrough and it is all over the internet. I think the article here covers most of the details. There are a lot of further improvements done to YOLO and if you want to into details of each YOLO variation, have a look at the article here.

My work using this approach

I always wanted to try this so I worked on a simple project which replaces OpenCV Model and NN model in my live gender recognition with a single regression model, which is one of the key concepts of YOLO. You can read more about it here.