Street Sign Classifier

Overview: 1-2 minute read

Due to the increase in autonomous diving, automatic traffic signs detection and recognition is a very important field of computer vision. Traffic sign recognition plays a humongous role in guiding traffic regulations, varying between warnings, road condition information, traffic regulation and destination information.

In my system, the classification stage relies completely on the CNN model. Due to the complexity of CNN structure, it is required to have as much labeled data as possible to train a reliable CNN model. For this purpose, I used synthetic traffic signs. The procedure to synthesize these signs was to take the meta sign, then transform it geometrically, add a background, change it's lighting and then adjusting it's blur.

I used a CNN with a 60x60x3 input and a total of 7 hidden layers including Convolution layers, stride, dropout and max-pooling layers. I trained my data for a total of 6 epochs and got an accuracy of 92%.

In depth: 15 - 20 minute read

Abstract

Due to the increase interest in autonomous diving, automatic traffic signs detection and recognition is a very important field of computer vision. In this project, I used convolutional Neural Network to classify the street signs. The Network consisted of 7 convolutional layers, some of them followed with stride. In order to prevent the model from overfitting, I took the average of of the feature maps and added a Global Average Pooling layer. I performed classification on the GTSRB dataset which contained 39209 training images and 12630 testing images. The Program gave me an accuracy of 92%

Index Terms—Convolutional Neural Network, Computer Vision, object recognition, object detection

Introduction

Traffic sign recognition plays a humongous role in guiding traffic regulations, varying between warnings, road condition information, traffic regulation and destination information .To recognize these traffic sings in an image, there are two main steps in almost all methods: Detection and Classification. It was pretty difficult to perform these step earlier because there was no publicly available dataset until the release of GTSRB(German Traffic Sign Recognition Benchmark) and the GTSDB(German Traffic Sign Detection Benchmark) in 2011 and 2013 respectively. After the release of these datasets, the research on Traffic Signs took a big leap.

Traffic Sign Classification

There are many methods which have been previously used to classify traffic signs such as LDA, SVM, ANN and many more. Long Cheng et. al. in “Traffic Sign

Detection and Recognition for Intelligent Vehicle“ use OCR systems and perform pictogram-based classification. Authors have also made use of LDA to distinguish between traffic signs. Another widely used approach is to classify using Multi-Layer Perceptron. SVM(Support Vector Machines) are also largely adopted to classify the inner parts of road signs. Ensemble techniques such as Random Forest are also used to classify Signs. Neural Networks is the most used of all. In this project, I use CNN’s with an extra layer called the GAP(Global Average Pooling) layer which uses the average of all feature maps and helps reduce overfitting.

Properties Of Traffic Signs

Traffic signs have unique features that make them look different from other objects:

  • They have a 2-D shape which includes squares, triangles,circles, rectangles etc.

  • The color used are generally simple primary colors(RGB), and also yellow as it is easily identifiable by the driver.

  • Each traffic sign has a unique color and so do the writings on the traffic sign.

  • The signs could hold either a figure or a character; or even both.

Data Set

Contained 39209 Training Data Images

  • Contained 12630 Testing Data Images

  • There were 43 different type of images

  • There was a meta image for each different type of image

Discovering Dataset Balance

It is important that the proportion of each street sign is equal in both the training and testing data for maximum accuracy. A simple histogram of counts of each sign in both the datasets will tell us if it is balanced or not. Figure 1 As we can see (fig Discovering Dataset Balance), the dataset is balanced.

Synthetic Data Collection

In my system, the classification stage relies completely on the CNN model. Due to the complexity of CNN structure, it is required to have as much labeled data as possible to train a reliable CNN model. For this purpose, I used synthetic traffic signs. In the street view images, there are also a lot of symbol based traffic signs. However, their distribution is not uniform. For example, speed limit signs are very common in real traffic scene, while the signs such as weight limit signs only exist in some special scenes. For the same reason, there are many subclass signs with different speed values in speed limit signs, and some signs with very low or high speed values are also not common in many scenes. To acquire more data with low cost and balance data quantity in each class at the same time, we also synthesize traffic sign images from standard sign templates. The pipeline of how to synthesize data is shown below. First, we get all standard signs that need to be classified. The standard signs are manually processed to add a mask channel which is important to synthesize the sign images with background. To simulate the viewpoint change in the real scene, random planar affine transformation is applied to standard signs. A planar affine transformation has the matrix representation of equation with six parameters

The bounding boxes of detected traffic signs in real scenarios include both the sign and background, especially for those non-rectangle signs. To model this property, we add background to the transformed sign generated in the previous stage. The background images are collected from real traffic scene without any signs. First, a patch of the same size as the sign is randomly cropped from the background image set. Then the sign image is added with the background image patch. Similar to background image patch, the lighting image is also randomly cropped from lighting image set which contains images with varying luminescence. Finally, the synthetic images are blurred with Gaussian kernel of random size.. It can be seen that they have different appearances, which are crucial to train a network to recognize traffic signs with many variations.

Methodology

Convolutionary Neural Networks(CNN)

  • It is a class of deep learning methods which has become dominant in various computer vision tasks and is attracting interest across a variety of domains.

  • Inspired by the organization of animal vision cortex, it is used for data which has a grid pattern such as images.

  • Composed of multiple building blocks, such as convolutional layers, pooling layers, and fully connected layers, and is designed to automatically adaptively learn spatial hierarchies of features through a backpropagation algorithm

Implementation

The input to the Convolutionary Network will be a three dimensional object with dimensions:- height, width and color where color is R,G or B. Originally, the images were of varying sizes with their heights and widths varying from 60-90. All the images were reshaped to 60 X 60 so that the input was consistent; and this input was fed to the convolutional layer.

The Convolutional Layer extracts features maps by linear convolutional filters from input image which is then followed by some nonlinear activation functions (Sigmoid, Relu, tanh etc.). Then, extracted feature maps are passed through another layer, out of which two are followed by stride 2. In CNN, stride is used with Convolutional Layer for down-sampling. A brand new set of feature map is then created by passing the filters over the output from first down-sampling. The amount by which the filter shifts is the stride. We are using stride 2, which means 2 positions to be skipped. Another benefit of using Strides is that it controls the filter which convolves around the input volume. When the kernel is sliding the input, stride is extensively used to calculate how many positions are to be skipped(2 in my case).

Linear Rectifying Units (Relu) is one of the most vastly used activation function for hidden layer in deep learning. Relu has output 0 if the input is negative and if the input is greater than zero it gives us a value which is equal to the input. Hence, it does not reduce the size of network. The result of using Relu activation function is that it is way faster to train large network and also increases its nonlinear property. Global Average Pooling (GAP) used to minimize the number of parameters and to protect model from overfitting. The idea is to reduce filter size by simply taking average of whole feature map first introduced by. Although, GAP is similar to max pooling layer but it perform more extreme type of dimension reduction. Where dimension h × w × d is reduced into the dimension size 1 × 1 × d. GAP reduce each feature map by simply taking the average of whole feature maps. As the last layer, also called classification layer we used softmax layer. It can be used for predicting hundreds and thousands of classes. The output of softmax is equal to the total number of fruit classes.

The network contains a total of seven convolutional layers, in which the second and third is followed by stride 2 and the final layer is followed by a GAP layer. Relu activation function is applied at the output of each convolutional layer. For faster processing, I batched the input with 128 images in each batch. The size of each image was a 60 × 60 grid. This input was then passed through the convolutional layer of 16 filters; each filter had a size 3 x 3 and then for dimension reduction, it was followed by a Stride 2 layer in the second convolutional layer. Then it was passed through the same series of layers a second time. But the layers now contained 32 kernels of size 3×3 as shown in layer 3 and 4. The output from these layers was then passed into one more convolutional layer which contained 64 kernels of size 3 × 3. To simplify the network further, the output from this layer was fed into a basic 1 × 1 convolutional layer from layer 6, and then it underwent another Relu activation unit. The idea here is to reduce the size of filters from 3×3 to 1×1 by taking the mean of all the feature maps first investigated by . In the final layer, the resulting output is passed through GAP layer, then it is directly fed into softmax.

Results

The experiment shows that CNN model which has FC layers lacks performance. One possible reason for this could be that the neurons in the FC layers have connections with all activation in the previous layers. Adding FC layers may give us better accuracy for image classification but the complexity and model parameters increased significantly. Network with FC layer’s accuracy is 89% and total number of wrongly predicted images are 108. Further,to avoid overfitting, dropout used in third network to drop connection of some layers during training. Although, dropout avoid overfitting and improve accuracy but the number of parameters remains same as CNN without dropout. Average accuracy slightly increased to 90% but the wrongly predicted images decreased to 101. As presented , PCNN with GAP achieved highest accuracy.

Conclusion

I presented a novel approach to improve street sign classification using Pure Convolutional Neural Network (PCNN) with Global Average Pooling (GAP). We have found that using GAP layer, rather than Fully Connected (FC) layer, provides superior performance and overcome overfitting problem. To train a reliable network,street view data and synthetic images are used to generate larger number of data with low cost. My system gets the state-of-the-art result on a challenging new data set. PCNN can be successfully trained to classify various types of street sign images. This makes it possible to use same approach for both object recognition and multi-class image classification.