In my article about object detection approaches, I mentioned a few ways to handle object detection and introduced YOLO NN model which uses a single Neural Network to simultaneously localize as well as classify the objects. It uses a regression approach to detect the object by dividing the image into a grid. To me, it seems really simple but innovative approach and I wanted to try it out myself. So I decided to rewrite my live gender classifier to use only one NN to detect as well as classify faces in one single go. Although YOLO has more details than this, for now this is the only thing I decided to focus on. In this article, I will give an overview of my project and how I implemented it.

# Outline

In my previous article, I used an OpenCV cascade classifier to detect faces and a trained Neural Network to classify them. The model was limited to the front view of the face and doesn’t work on other views. This project tries to remove this limitation by using a YOLO style regression model.

The aim of the project was to experience this new approach and does NOT focus on getting good detection or classification accuracy. We will see how the classification by this model compares to individual models (with similar architecture) trained to just classify the detected faces instead of classifying and detecting faces at the same time as done in YOLO and in this project.

I will go over all the steps of the project but will not go over code here. Feel free to check my GitHub for the code. The complete code for this project can be found here.

# Dataset

I am using imdb-wiki face dataset. The dataset is around 250 GB with about 520,000 images! As the name implies, the dataset is gathered using celebrity images from IMDB and Wikipedia websites. The images are provided as it is, that is, it has not been cropped or pre-processed (although cropped face images are available for anyone who needs them). Each image has metadata associated with it, which is provided as a separate file. It contains the date when the photograph was taken, date of birth of the celebrity, face location (as detected by an automated algorithm), and gender.

The dataset is quite big and the files are provided in parts as different downloads. There are 10 parts for IMDB images and 1 for wiki images. For our toy project, we don’t really need so much data but we can’t test anything with only a few images either. So I used only 1 part of IMDB images (part 0), which is around 27 GB and contains sufficient images for our use case.

## Data Pre-processing

Before we can use the dataset, we need to process it to extract the metadata, remove corrupted or unrequired images, and convert the data into required dimensions and format. I preprocessed the data and merged all the required details in a small set of files. My program then uses these files instead of raw data. Let’s go over the details.

You can find complete script used for preprocessing here. You can call the script with -h option to see all available options. The script produces .tfrecord files, the details are mentioned below. It also provides an option to produce cropped faces which I used to train a NN for comparison.

The dataset is provided in tar files. We can either extract the tar or directly process the tar file using tarfile python module. The metadata is provided in a mat file and we can use scipy.io.sio from scipy module which has a loadmat function to handle such files. It returns an object which can be used to directly access each column as a numpy object arrays.

Since I am using only a part of the data, I iterate over all images in the folder or tar file instead of the metadata file and only process the files for which I successfully find the metadata. I am using the date of birth and the date when photo was taken to find the age. I save the age, gender and face location (bounding box top left and bottom right coordinates) along with the images. Check out my code for details.

### Filtering images

We need to ignore some of the images since some of the images are missing metadata while some images are corrupted or too small to be useful. So I remove the following images:

• Images for which gender or age is not available.
• Images in age group below 10 or over 80 are very rare and hence are not useful (since I will be training to predict age and there is not sufficient data for these groups), so I ignore them too.
• If there is an image which has face bounding box size larger than the image itself or if one of the bounding box coordinates lie outside of the image.
• Images for which the face confidence score is low (since these are automatically located using an algorithm, they also provide confidence as provided by used algorithm).
• Images for which confidence is missing (if any).
• Images which have multiple faces (the dataset provides second face confidence to check for these)
• Images which are smaller than a threshold (in either dimension). There are some images which are invalid (or corrupted) and contains some random white pixels. Luckily, all of them are too small and hence this takes care of them too.

Note: Although we take images with just one face, it is entirely for ease of training. The way NN is structured allows it to detect multiple faces and the way we train it (by dividing the input into grid same as YOLO), it will detect multiple faces even if we train it for just 1 face. This is because each face prediction in our model is independent of any other prediction done by NN.

Some of the images are removed by one of these checks. Some of the remaining images might still not be perfect (some images still have multiple faces) but these are in minority and we can ignore them for now.

### Resizing images

Although I will be using a fully convolutional network which doesn’t care about input size, we still need to use a fixed size in order to train it using multiple images in a single batch (all images have to be of the same size to form a matrix).YOLOv2 combines same size image together and uses a batch of same size images but different batch has different image sizes. But we won’t be doing that for simplicity.

I resized all of the images to 250x250. Since all images have a different aspect ratio, I padded the image with black pixels to maintain its aspect ratio as I resize it. So all input images are 250x250, and if needed, are padded with black color so the aspect ratio is preserved for actual image data.

We need to process a lot of files and we can easily speed this by utilizing multiple cores that most of the systems have. One simple way to do that will be to use threads. Although python does have threads, unfortunately, python (cpython) threads don’t use multiple cores and everything runs on a single CPU. So we don’t really see much improvement using python threads. Instead, I will use the multiprocessing module which provides similar API but uses processes instead of threads.

I divide the program into a few producer threads and few consumer threads. They interact with each other using shared queues. The main thread pushes file paths to producer threads, which read the image file and associated metadata. This data is then pushed to consumer threads which process and resize the image and write them to a output file in the required format. The data is written as tfrecord file.

### TFRecord File Format

To train our model, our program will need to read a lot of input data. We don’t want file IO to be a bottleneck. If we are using tensorflow, one way to solve this would be to use a tfrecord file. Tfrecord is a file format designed by tensorflow for sequential data. Since we just need to iterate over data for training and don’t need any random access, we can easily use tfrecord format. By storing everything in 1 file, we avoid multiple file lookups. Also, tensorflow provides methods to read tfrecord file for training in an optimized way.

Tfrecord file can store any numerical in form of Tensorflow features, which is an array of numbers. So we can store 2 features per element, one for input and 1 for expected output. That way we can quickly and efficiently get both input and output during training by just iterating over this file.

To simplify things, each consumer thread in our code will produce a separate tfrecord file. Therefore instead of 1 big tfrecord file, we will have 10 of them. This makes transferring files (if needed) easier and also makes it easy for use to separate out test data. We can simply choose 1 of them as test file. 10 is way less then 10000s of files we need to read, hence its still efficient.

Tensor flow provides functions to easily create tfrecord files. Check out my code for more details.

# NN Model

As mentioned before, I planned to use a regression technique similar to one introduced by YOLO. I divided my input image into a grid (not literally, in an abstract way to interpret the output), each grid box predicts 1 bounding box in terms of box center coordinates and its width and height. This is different than YOLO where they predict multiple boxes per grid cell. They had to predict different types of boxes as YOLO is designed for general object detection. Here we are just detecting faces which are mostly similar in terms of aspect ratio so 1 box should be enough for us.

Similar to YOLO, along with box we also predict a confidence score which tells us how confident the grid cell is that there is a face. The grid cell which has the center of a face is responsible for detecting it with high confidence. All other grid cells should have low confidence. For more details checkout YOLO paper or my quick overview of YOLO here.

YOLO also predicts conditional probabilities which describes that given there is an object center in the grid cell, what is the probability that it is an object O. We will use the same concept for face classification. I am using 2 different classifiers. One gives us the probability that a face is male or female given there is a face. Other gives us the probability that person is in the age group 10-20, 20-30, 30-40 etc. given there is a face.

Since our outputs are structured in 3D, in form of 2D grid and several outputs per grid, we can actually use just convolution layers (with pooling) without any fully connected layer at the end (since convolution directly gives us 3D output). A similar concept was used in YOLOv2, although it had a lot of other details as well, we will just try a simple convolution structure. So our model is a pure CNN.

To get the desired number of Grid Cells, we will use convolutions with the SAME padding and will use a calculated number of max pool layers in between such that the final output has exactly same number of pixels as our number of grid cells. So each pixel represents outputs of 1 grid cell.

To get the desired number of outputs per grid cell, we will be using 1x1 convolutions to reduce the number of filters to exactly the number of outputs we need.

I used 5 max-pool layers. My final NN architecture is:

## Activation Functions

All layers of my Neural Network are using a ReLU activation, except for the last 1x1 convolution layer. For my last layer, each filter uses a different activation based on what it is trying to predict. For each of the output, activation is as follows:

• Confidence for the box: sigmoid since the output is a score between 0 to 1.
• Width and height of the box: exponential since I am trying to regress absolute width and height which can be large numbers. I tried others like ReLU, exponential just worked better.
• Center of the box: sigmoid since the center is relative to grid cell and is between 0 and 1.
• Gender classification probabilities: Relu followed by softmax.
• Age classification probabilities: ReLU followed by softmax.

## Loss function

Original YOLO had squared error based loss function. I tried it but it requires fine-tuning of weights for different parts of the loss function (which part should impose more penalty). After trying a few times, I shifted to my custom loss function which is quite simple and worked well (comparatively). I just took the sum of an average loss of each output and used a popular loss function for each output based on its type. In short, my loss function is the sum of following independent losses:

• Confidence for the box: average log loss since it is a confidence score. So we want high penalty if its guess is totally off.
• Width and height of the box: average mean squared error since it is a regression
• Center of the box: average mean squared error since it is a regression
• Gender classification probabilities: average softmax cross entropy which is standard for softmax based probabilities
• Age classification probabilities: average softmax cross entropy which is standard for softmax based probabilities

Average above refers to average of loss for all grid cells.

The above calculated loss depends on whether a cell has a face center or not, same as YOLO. In other words, width, height, and center loss are considered only for cells which actually have a face center. Same for classification probabilities since all of these are conditional on the face. A face will have a center, width or gender only if there is a face in the first place. So the only penalty which we consider for the cells which do not have a face is the confidence score since they are supposed to give 0 confidence if there is no face.

Since we filtered our dataset to have images with only 1 face, other than confidence all losses will be ignored for all cells except one (which has a face center). So while taking an average for these, we are effectively dividing by 1. Taking average is not actually doing anything here except for confidence, which is a valid loss for each grid cell and hence will be averaged by dividing by the number of cells. I thought of a general loss function so that it is valid even if I decide to train on images with multiple faces. With this particular dataset, we can ignore all average other than confidence.

# Performance

To evaluate the model, I used IOU (intersection over union) metric for bounding boxes and accuracy for classification. The model achieved following metrics:

• IOU: 0.634
• Gender Accuracy: 0.781
• Age Accuracy: 0.341

Note: The accuracy in calculations include the cases where it doesn’t find a face. So it actually measures how many faces does it correctly identifies. If it doesn’t detect a face in the first place, it also affects this accuracy. If we just check for faces where it detected the face, accuracy might actually be higher.

We can see that the model has decent performance for detection and gender classification. It does poorly in age classification but that is not because of the approach, but because of some other inherent problem. As we will see in the following section, an independent convolutional model trained to just classify faces based on age also performs poorly. This is due to the fact that the data is very imbalanced with respect to age. Majority of data is in the age group 30-40. This can be seen in my data exploration for this dataset here. So the model gets biased towards this and predicts 30-40 most of the time. So we need to apply techniques related to an imbalanced dataset to fix this and this should not be considered a drawback of the NN model or loss function.

Now let’s look at how it compares to normal (non-YOLO style regression) classification models which are trained only to classify a face as compared to detecting as well as classifying a face.

# Model Comparison

I trained independent convolution NN models with similar architecture, activations and corresponding loss to just classify a face instead of trying to do all things at once. Using separate models will be slow as we will need a different model to detect the face so that these models can classify it. In comparison, our YOLO based approach does everything at once and hence is easier to train and is faster. Here we will see how they compare just in terms of classification performance.

For these classification models, I used the same preprocessing script but to get cropped faces instead of the full image. These models will then classify these faces.

These classification models have less number of layers (as they are working on a simpler problem and performed better with fewer layers). But even then, our single model will run faster as it still has fewer layers and filters compared to using 3 different models for all 3 tasks.

These independent models achieved the following metrics:

• Gender classification model accuracy: 0.834
• Age classification model accuracy: 0.364

We see here that these models performed better than our single model. But, we should also keep in mind that these models already had correct faces, whereas our model had to detect faces before classifying. Therefore, if our model’s face detection improves, it will automatically improve the classification performance as some of the faces were not classified correctly as they were not detected correctly. The same applies here as well. Here we passed the ground truth images for faces to these model. Had there been an actual face detection model in place, all faces would not have been 100% correct and hence the classification accuracy would drop. So in real the scenario, the accuracy will be lower than this.

Also, we can see here that even the independent model performs poor for age classification. As explained above, this is due to an imbalanced dataset with respect to age and affects both our models.

In conclusion, even though independent models performed better but their accuracy will drop if we use a face detection algorithm to first detect faces. Hence for complete detection + classification pipeline, the performance by these models is not drastically better than our single model but our model is faster and hence better for real-time use cases. This shows that the approach used in this project actually works and is applicable in such scenarios.

# Next Steps

We have successfully seen how the YOLO style regression approach is applied for detection and by using comparison have concluded that the approach is applicable and can be successfully used in detection + classification scenarios. The aim of the project was NOT to get good accuracy and hence has neglected a lot of details and techniques that could have been used to get even better results. Some of them are already mentioned and used in original YOLO architecture but we chose to ignore them in this project. Few of the things that could or should have been used are:

• Experimenting with the NN parameters to improve results
• Predict width and height of the bounding box relative to the image size. This shrinks the range of output between 0-1 and was done in YOLO.
• Predict $$\sqrt{width}$$ and $$\sqrt{height}$$ instead of width and height so that the bounding box error for large boxes is not as significantly penalized as for small boxes (since compared to large box small error is acceptable which is not the case for small boxes). This was done in YOLO.
• Use techniques such as weighted loss or oversampling to handle imbalanced dataset with respect to age.
• Use more data (we used only a part of available data).
• Pre-train network on some related classification before using for this task. This approach is suggested and used in YOLO.
• Either use previous layer’s filters (similar to YOLOv2) or use a fully connected layer (similar to YOLO) for final prediction. Right now final layer doesn’t use context from every pixel of the input image.