1. ABSTRACT AND MOTIVATION
CIFAR-10 dataset has been used to implement LeNet5 model architecture. Convolutional Neural Networks has wide variety of applications in Computer Vision, Image classification and recognition, etc. CNN are necessary to extract the features from the images which are key elements in those applications. The HW5 model gives the accuracy which can be improved and made the model better.
Many State of Art models have been published and I tried to implement some of those to create new model and achieve higher accuracy.
In most cases, modifying the parameters in the given architecture gives us the better result.
Also, Data Augmentation is also used as reading about it fetched me that it trains the model better. More details about using data augmentation is further discussed.
Model size is also one factor which is been considered while designing the same along with the model training time.
The LeNet architecture is same as the one displayed in the previous section. The main functionality of architecture is to extract the key features from the data and using it to train the model. LeNet model consists of 7 layers. The layers are as follows:
1. C1- Convolution Layer
2. S2- Sub-sampling layer
3. C3- Convolution layer
4. S4- Sub-sampling layer
5. C5- Fully Connected layer
Flowchart
Input Image(32*32*3)
Convolution 2D layer: 32 filters (3*3)
Max Pooling Layer (Pool Size: 3*3)
Convolution 2D layer: 64 filters (3*3)
Max Pooling Layer (Pool Size: 3*3)
Fully Connected layer (256 filters)
Fully Connected Layer (128 filters)
Softmax Layer
Parameters: Algorithm run on CIFAR10 dataset
Input image size= 32*32, color image
C1:
Kernel filter used in convolution: 3*3, 32 filters are used.
Padding used.
Activation- ReLU
Output image size= 3*3
Output- 32*32*32
S2:
Max Pooling
Padding used.
Pool size= 3*3
Output= 11*11*32
C3:
Kernel filter used in convolution: 3*3, 64 filters are used.
Padding used.
Activation- ReLU
Output- 11*11*64
S4:
Max Pooling
Padding used.
Pool size= 3*3
Output= 4*4*64
F5:
256 filters Activation- ReLU
F6:
128 filters Activation- ReLU
Output layer: Number of classes in the dataset= 10.
Activation- Softmax
Fig. 3 Parameters in the architecture
Reason choosing these parameters:
Having done HW5 problem 1, I had a fair idea what each parameter does and the functionality of the same.
Firstly, I knew data augmentation was necessary which was not done by me in homework 5 thus increasing my accuracy here.
Adding, choosing the number of filters was trial and error as I knew more filters will make more trainable parameters which have potential to increase the accuracy but downfall of high model size. I have tried to keep the best number of filter and the kernel size.
Max Pooling was used in padding was set to avoid data loss around boundary and kernel of 3*3
Choosing the second convolutional layer, number of filters to be higher than 1st convolutional which is usually chosen, I did the same as well.
The second pooling layer parameter was chosen same as that of previous pooling layer.
The dense layer parameter was chosen in the power of 2 and decreasing thereby.
Dropout also helps in achieving higher performance which is why I have added the same. It is usually between 0 and 1, normally chosen as 0.1 To 0.4.
Finally, the dense layer using softmax with parameter 10 which is same as that of labels in the data.
Training Mechanism:
1. Data Augmentation: Convolutional Neural Network (CNN) is invariant to shift, scale, rotate, illumination, etc. This property is used in data augmentation. Data augmentation is used to randomly choose the images and make multiple copies of it using the original image by shifting, scaling, rotating, shearing, zooming, whitening, etc.
In real word, we have very limited amount of data to access. So, whatever the data is, we need to utilize it fully. The given data can be made into a large amount of data by using data augmentation. Each image can have multiple copies with image looking different for the each one.
The main reason to use data augmentation is to train the model better. Let say an inverted test image appeared in the model. This image should be correctly identified. To accommodate all the different test images, we make a model such that it correctly classifies all the varieties.
Parameters used in Data Augmentation:
a. Width Shift range= 0.1
b. Height Shift range= 0.1
c. Fill Mode= Nearest
d. Horizontal Flip= True
Many more parameters can be altered. I found these parameters to be best and got more performance accuracy using those.
2. Loss function: Cross Entropy Loss which is also called log loss is used as it measures the performance when the output is a probability value between 0 and 1. As we are using softmax function in the last layer which gives the output in 0 and 1, I have used cross entropy as the loss function.
Moreover, Cross entropy is a good loss function for the classification problem as it works on the principle minimizing the distance between two probability function, which are the predicted function and actual function.
Fig. 6 Cross Entropy
3. Optimizer: I have used ‘Adam’ optimizer which is supposed to give better results in most cases. SGD is also one which is used, but my accuracy was better in using Adam. The parameters in Adam are invariant to the rescaling of the gradient unlike SGD optimizer.
4. Batch Size:
Batch Size is the number of training samples in one pass. Batch Size becomes important as the number specified are trained in one epoch. If it is not mentioned, all the training data would be trained in all the epochs. This will create a huge load on the memory. Generally, the smaller the batch size is better which is chosen in the powers of 2 majorly 32,64,128,256. The networks trained better using the mini batch sizes as the update of the parameters takes after every batch size. I have chosen batch size of 128 here as it gives more accuracy then the batch size of generally use size of 32.
5. Epochs:
It is defined as the one forward and one backward pass for all the training data. It is a hyperparameter meaning the number of times the learning algorithm would work over the entire training data.
Algorithm:
1. Loading training and testing images from CIFAR10 dataset.
2. Training and testing labels converted to categorical type.
3. Pre-Processing the training data to get 0 mean and unit variance. This is necessary step as the data present maybe out of range.
4. Using sequential model and adding layers onto it.
5. Model includes 7 layers which are as follows: convolution, max pooling, convolution, max pooling, fully connected, fully connected, softmax layer. (Parameters given as per discussed above.)
6. Model compilation using categorical loss function.
7. Choosing best optimizer for the model. (Here- Adam)
8. Data Augmentation is used to make the model train better.
9. Training data passed through the model and fitting it into model.
10. Passing testing data through the model over specifying number of epochs.
11. Calculating test loss, training and testing accuracy.
12. Plotting train and test accuracy graphs.
3. RESULTS
Best Choice of Parameter
Epochs= 50
Batch Size= 128
Dropout=0.2
Optimizer= Adam
Model Parameter: C1- 32 filters, kernel 3*3
S1- Max Pooling, 3*3
C2- 64 filters, kernel 3*3
S2- Max Pooling, 3*3
F1: 256 filter
F2: 128 filters
F3: 10 filters softmax
Training Accuracy= 0.8239
Training Loss= 0.4987
Testing Accuracy= 0.8041
Testing Loss= 0.6002
Model Size:
The model size should be small as possible. While trying various parameters, I realized the model size can go pretty high very quickly. To monitor it, the parameters while training the model become super important. I have selected those which I feel the total learnable parameters are enough. As seen from the below figure, it does give the total parameters after each layer in the model and also the total model size.
Total model size: 315,978
Fig. 9 Highlighted parameter size
Model Time:
The model time can be divided into 2 forms: training time and the inference time. The training time is the time, which is used for the training data, while the inference time is the time to infer the predictions on test data using the pretrained model. Ideally, the training time should also be smaller, and a lot of factors depend on it like the CPU, GPU or the memory one system has.
As per my model run on GPU, I get each epoch runtime as 25 seconds for each epoch. I am running for 50 epochs which means 50*25= 1250 seconds= 20.83 mins.
So, my model training time = 20.83 mins
About inference time, the testing data here is of 10K images which hardly takes around 1-2 seconds to predict once model is trained.
The total time which includes the model training time, inference time, plotting, saving the model, etc. has been reported which is 1264.1145 = 21.06 mins. (Fig. 10)
Graph:
The model accuracy vs no of epochs and model loss vs no of epochs graph for the modified model is as shown below. It can be inferred that the model is neither underfitting nor overfitting.
Dropping random train samples:
Fig. 13 Size of new train data after dropping 5K images
Training Accuracy: 82.19
Testing Accuracy: 79.42
Graph:
Case 2: 40K training images (10K images dropped)
Fig. 16 Size of new train data after dropping 10K images
Training Accuracy: 82.00
Graph:
Fig. 18 Accuracy vs epoch and Loss vs Epoch graph after dropping 10K images
4. DISCUSSION
a. Performance Improvement from the result of homework 5 problem 1B:
The parameters and hyperparameters in the homework 5 problem 1B were given which weren’t giving the high accuracy as of this one.
Parameters
Problem 1B (HW5)
Current Model Updated
Batch Size
16
128
Pre-Processing
Yes
Yes
Data Augmentation
No
Yes
Optimizer
SGD
Adam
1st Conv layer
6 filters, 5*5 kernel
32 filters, 3*3 kernel
2nd Conv layer
16 filters, 5*5 kernel
64 filters, 3*3 kernel
1st Max Pooling
Pool size: 2*2
Pool size: 3*3
2nd Max Pooling
Pool size: 2*2
Pool size: 3*3
Padding
Valid (No Pad)
Same (Pad)
1st Dense layer
128
256
2nd Dense layer
84
128
Dropout
0.3
0.2
Last Dense layer
10
10
Epochs
20
50
Table 1. Comparison of models
Problem 1B Result:
Training Accuracy: 77.13
Testing Accuracy: 65.16
Current Model:
Training Accuracy: 82.39
Testing Accuracy: 80.41
As already discussed, the current model has more trainable parameters which results in higher accuracy then the previous one. This is done by increasing the filter size in both convolution layers.
Data Augmentation was added to the current model which helps in better performance as well.
The dense layer filters size has been increased to accommodate more features, same with the padding around the layers.
Batch Size has been increased to train more data in one pass.
All these collectively effect into the better training and testing accuracy for the CIFAR10 dataset.
b. Degradation when randomly drop train data:
We are training all 50K images in the training data while making the model. But as we drop some random images from the training data, we see the degradation of the training and testing accuracy.
This is because we don’t have much data present in the network. It is always better to train the model with as much data one can provide. The model needs to be trained perfectly which is not possible if we give less training data as the input.
I have used 45K and 40K images as my trial for randomly dropping images which means I have dropped 5K and 10K respectively and trained the model and noted the results.
It can be seen in the result section that the training and testing accuracy in both cases have been decreased. The decrease has been 1-2% from the original model. If more samples are reduced the accuracy would drop down more. (Fig. 14 and 17) The graphs for Accuracy vs Epoch and Loss vs Epoch for both cases have also been plotted in which over entire epochs the accuracy can be visualized. (Fig. 15 and 18)