Skip to main content

Building an object detector in TensorFlow using bounding-box regression

Object detection is both classifying and locating objects inside an image. In other words, it is a combination of image classification and object localisation. Building a machine learning model for image classification is simpler and I have described it in one of my posts here. However, an image classifier cannot tell where exactly an object is located inside an image. To achieve this we need to build a neural network which can locate an object inside the image in addition to classifying it. In this post I am going to describe how I built a neural network for object detection by solving both these problems.


The Object to detect

Since my intention was to build the model from scratch without spending too much time on data preparation, I picked the simplest possible object I could think of. My choice was to make a simple red color vision marker using a piece of cardboard. Since the vision marker was just a 2D shape, the different angles from which it can be captured is limited — hence the limited number of images required for the training set. I made the vision marker circular shaped and palm sized for simplicity and ease of handling.


Vision marker made using a piece of cardboard


Architecture of the Model

Before diving into the implementation details, I thought of describing you the architecture of the model. First of all, my objective was to infer two answers:

  • Whether or not the image has the object in it
  • Where exactly is the object located

To solve the first problem, I could use an image classifier — a convolutional neural network with two output neurones. One of those output neurones could represent the presence of the object whereas the other one could represent the absence of it. In other words, the output of the image classifier is a probability distribution among the different object classes — or in this case, the presence (“Circle”) or absence (“No-Circle”) of the object. The class with the highest probability could then be determined by applying the Softmax function at the output layer.

Solving the second problem was completely different. What I wanted there was to locate the object by drawing a bounding box around it. To do that, I had to find the pixel coordinates of the top left corner and the bottom right corner of the bounding box. This means my neural network had to figure out the x, y coordinates of those two points inside the image. I could do this by designing a convolutional neural network having 4 neurones in the output layer — representing the 4 numerical coordinate values.



Then the challenge was to achieve these two objectives using a single neural network. On the one hand, the image classifier part of the solution treats the values of the output neurones as a probability distribution. Then it picks the one with the highest probability and takes its label as the answer. On the other hand, the object localisation part of the solution requires the 4 output neurones to give the actual bounding box coordinates. Therefore, it is difficult for the neural network to train its fully connected dense layers to fulfil both these requirements simultaneously. Because, optimising the weights and biases for the classification would jeopardise the output of the localisation, and vice versa.

The solution to this problem was to design a neural network with two branched outputs. Since the feature extraction has to be performed equally for both problems, I made the convolution layers common and sharable. However, after the convolution layers I branched the network into two — each with their own dense layers and output layer to achieve the two different outcomes.


Architecture of the dual headed neural network


This architecture allowed me to train these two heads or branches with different loss functions and activation functions. Moreover, this allowed me to train them separately using different datasets. I will describe how I did this and why it was important to do so later in this post.


Data preparation

As with any other machine learning project, data preparation was the key here too. First I configured my camera to capture square shaped images by setting the image resolution to 2160 x 2160. Then I took around 126 photos of the vision marker placing it in different positions inside the image frame and placing it at varied distances to the camera. Moreover I took them with different backgrounds. These variations in the training dataset helped the model to achieve more accurate predictions at the end.


Part of the training image set


I took another 62 photos without the vision marker in them. This was done to train the model to identify the absence of the marker. Then I resized all photos to be just 216×216 pixels in resolution as it was unnecessary to use higher resolution images for identifying such a simple object.


Image annotation

The next step was to annotate the vision markers using bounding boxes. This is an important step in building an object localiser unlike in an image classifier. I used the free and open-source annotation tool called VoTT for this purpose.

First I created a project in VoTT called “Vision marker detector”, and created the source and the target connections pointing respectively to the folder locations in my computer for source images and the generated annotations.


Creating a project in VoTT for the annotation task


Moreover, I created a tag (label) called “Circle” for annotating the vision markers in images.


Added a tag named “Circle”


Then it was time for the not so interesting task of annotating the images!


Example annotation in VoTT


The photos taken without the vision markers were just skipped here. I will describe how I used these images for the training later in this post. However, I made sure the following setting in export settings was set to generate the annotation XML files for the unassigned (non labeled) images too.


Set to export unassigned images too


After the tedious task of annotating the images, I exported all annotations to the PASCAL VOC format using the export option in VoTT.


Organising the folder structure

I divided the images and the corresponding annotation xml files into two sets — one to represent the training set and the other to represent the validation set.


Training and validation images + annotation XML files


Then I moved them into a parent folder called “Images” inside the project folder structure.


Project folder structure in VS Code


Creating the datasets

In addition to the images and their labels it is important to use the bounding box coordinates in the process of preparing a dataset for training an object detector. For this purpose I used a python script which reads all the xml files produced by VoTT and generates two CSV files for the training and validation datasets. I have written this script in such a way that I could run it with the configuration variable — SKIP_NEGATIVE set to True or False to exclude or include the negative images (images without the object in them).




Then I wrote the following code to create the training datasets by reading the training_data.csv file which was generated by the above step. Here I created 3 lists — first one for the list of image data arrays, and then the second and third lists for the corresponding bounding box coordinates and the image labels respectively.



I used the same code to load the validation dataset as well. Then I used the following code to covert the lists to numpy arrays.


train_images = np.array(train_images)
train_targets = np.array(train_targets)
train_labels = np.array(train_labels)validation_images = np.array(validation_images)
validation_targets = np.array(validation_targets)
validation_labels = np.array(validation_labels)


When training the model, train_images array was used as the input data parameter (or parameter x) to the fit method of the Model class in Keras API. Then the arrays — train_targets and train_labels were used together in a dictionary as the target parameter (or the parameter y) to the fit method. You will notice this later in this post.


Building the model

Then it was time to build the model!

First I imported the necessary dependencies into the script.


import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tffrom tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import pathlib
import pandas as pd
from PIL import Image 
from PIL.ImageDraw import Draw


I created some configuration parameters in the form of variables and defined the array of classes used for the model.


width = 216
height = 216
num_classes = 2classes = ["Circle", "No-Circle"]


Then I wrote the code for defining the model. Firstly, I defined the input layer followed by a rescaling layer to transform the pixel data onto the numerical value range 0–1. Then I created the convolution layers chaining the output of one layer to the input of the next. I named all these convolution layers with the prefix — “bl_” with the intention of grabbing them later using this prefix.




Secondly, I defined the classification branch layers by inputting the flattened output from the convolution layers as per the architecture discussed before. Here I added only two dense layers — one with 128 neurones and the last one with just 2 neurones corresponding to the two class labels we have to predict. Moreover, I gave the prefix “cl_” for the layers of the classification branch.



Thirdly, I defined the localisation branch layers once again inputting the flattened output from the convolution layers. Here I added 4 separate dense layers with decreasing number of neurones ending with 4 neurones at the last layer corresponding to the 4 bounding box coordinate values used for the prediction. The layers in this branch were named with the prefix — “bb_”.



Finally, it was time to create the model class by passing the input layer and the two output branches. This is where we weld the two output branches into the base model.


model = tf.keras.Model(input_layer,


The summary of the model looked like below



Compiling the model

Since the two output branches were designed to achieve two different outcomes (one to output a probability distribution and the other to predict the actual bounding box values), it was necessary to set the appropriate loss functions to each branch. I used the Sparse Categorical Crossentropy loss function for the classification head and the Mean Squared Error (MSE) for the localiser head. I achieved this by defining the following dictionary.



Then I used that together with the Adam optimisation method and compiled the model.


model.compile(loss=losses, optimizer='Adam', metrics=['accuracy'])


Training the model for localisation — bounding box regression

As I have described in the model architecture section, my plan was to first train the model for object localisation. During this part of the training the localisation branch of the model will perform a bounding box regression and then will adjust its weights and biases optimising for bounding box predictions.

The important thing here is that I used only the annotated images (only the images featuring the vision markers) for training the localisation part of the model. I achieved this by running the CSV generation script with the SKIP_NEGATIVES parameter set to True, before running the dataset generation code as described in the dataset creation section above.

The reason for skipping the negative images (images without the vision marker) when training the localisation branch is that otherwise it affects the accuracy of the bounding box predictions. Because we have to set dummy bounding box coordinate values — such as (0,0) (0,0) for negative images if we use them too in the training dataset. For example the following chart shows how the training performance looks like for the localisation branch, if we train with both positive and negative images in the training set.


Validation accuracy is much lower when trained with negative images


However, the downside of using only positive image for training the localisation branch is that it gives false positives for negative images. But since I had no intention of relying on the localisation branch to determine the presence or absence of the vision marker, this wasn’t a problem to me.

I used the following code which defines two dictionary objects for the training and validation targets of the two named branches — cl_head and bb_head. Here you’ll notice that the labels array was for the classification branch and the bounding box coordinates array was for the localisation branch.


trainTargets = {
    "cl_head": train_labels,
    "bb_head": train_targets
}validationTargets = {
    "cl_head": validation_labels,
    "bb_head": validation_targets


I set the number of epochs to 20 and the batch_size to initially. Then I ran the following code to train the model.


history =, trainTargets,
             validation_data=(validation_images, validationTargets),


The following charts show the training performance of the two branches.




You will notice in the above charts that the localisation branch has performed reasonably well. However, the accuracy achieved by the classification branch is too good to be true. The performance of it was untrustworthy as it had not seen any negative images thus far. This would mean that the model (the classification part of it) will always classify what it sees as a “vision marker” since it has never seen anything without it.

However, the model was still capable of predicting the bounding boxes accurately even at this stage.



But then again, it gave false positives when used against negative images, giving some random bounding box coordinates as expected.



This is what I wanted to solve by training the classification branch next.


Training the model for classification

Then it was then time to train the classification branch of the model.

The difference here is that we need to use both positive and negative images for training the classifier as the model needs to learn both the presence and absence of images to correctly classify the images it sees.

To achieve this I once again ran the CSV generation script, however by setting the SKIP_NEGATIVES parameter to False this time. This generated the CSV files containing the records for both positive and negative images. For the negative images it created records with all zeros for the bounding box coordinates and assigned the tag — “No-Circle” as the label.


CSV file with both positive and negative images


The rest of the steps in creating the new dataset were the same as creating the dataset for training the localisation branch.

The other important thing I did before training the model in the second phase was preserving the already trained weights and biases of the convolution layers and the localisation branch. Because, otherwise a fresh round of training with a different set of images could have jeopardised the already trained weights and biases of these layers resulting in declined performance. The solution for this was to freeze the convolution layers and the bounding box branch before training the classification branch with the new dataset.

I achieved this by grabbing the base layers and the localisation branch layers using their respective prefixes and setting the trainable property of each layer to False.


for layer in model.layers:
        layer.trainable = False
for layer in model.layers:
        layer.trainable = False


After this step there was a significant number of Non-trainable parameters in the model. It was visible in the model’s summary.


6M+ Non-trainable parameters in the model


Then I used the second dataset including both the positive and negative images to train the model. Since the base layers and the localisation branch layers were freezed, it could essentially train the localisation layers only. The training performance of the classification branch wasn’t too bad for just 20 epochs as shown in the following chart.



Using the model for object detection

Then came the fun part of using the model to do some predictions.

I used a new set of photos taken with and without the vision marker to test some predictions. Here the localiser output always gave a bounding box even if there wasn’t a vision marker in it. But this is not a problem in an application as we can always rely on the classification output first, to know whether the object is present or absent in the image, and then avoid drawing the bounding box if its not present. I used the same technique when displaying the below images and their bounding boxes in Jupyter notebook.


Detections by the vision marker detector



This exercise proved to me that we could build a simple single-class object detector from scratch without relying on large pre-trained models. Secondly, the dual headed architecture and the two phase training with different datasets was the key in achieving this result. The model gave a couple of false positives and false negatives, but this was mainly due to the limited training dataset and the non-optimum hyper parameters. We will be able to achieve better results by tuning the hyper parameters and using a larger training dataset.



© 2021 Creative Software. All Rights Reserved | Privacy | Terms of Use