$24.99
You should prepare your homework by yourself alone and you should not share it with other students, otherwise you will be penalized.
Introduction
In this homework, you will perform experiments on artificial neural network (ANN) training and draw conclusions from the experimental results. You will partially implement and train multi layer perceptron (MLP) and convolutional neural network (CNN) classifiers on CIFAR-10 dataset [1]. The implementations will be in Python language and you will be using PyTorch [2] and NumPy [3]. You can visit the link provided in the references [2,3] to understand the usage of the libraries.
Dataset Description
The dataset you will work on is CIFAR-10 dataset [1]. It is composed of 32×32 RGB images of 10 classes which are airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. ”Automobile” includes sedans, SUVs, things of that sort. ”Truck” includes only big trucks. Neither includes pickup trucks. Some samples from the dataset is provided in Fig. 1.
The dataset is provided in torchvision module of PyTorch. The dataset is split into two subsets: one for training and one for testing. For training, there are 60000 samples corresponding to 6000 samples for each class and for testing, there are 10000 samples corresponding to 1000 samples for each class.
The images are stored as 3 channel RGB 32×32 images. For simplicity, they will be converted into 1 channel grayscale images, so one can use them as 2-D directly in a Convolutional Neural Networks (CNN) or they can be flattened to 1-D arrays for Multi Layer Perceptron (MLP) structures.
Figure 1: Examples from CIFAR10 dataset. Taken from [1].
Each pixel of the image is an integer between 0 and 255. The labels are one-hot-labeled form of integers between 0 and 9 for airplane (0), automobile (1), bird (2), cat (3), deer (4), dog (5), frog (6), horse (7), ship (8) and truck (9).
In order to download and load CIFAR-10 dataset, use following command at the beginning of your code. import torchvision
transform = transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261)), torchvision.transforms.Grayscale()
])
# training set
train_data = torchvision.datasets.CIFAR10(’./data’, train = True, download = True, transform = transform)
# test set
test_data = torchvision.datasets.CIFAR10(’./data’, train = False, transform = transform)
You also need to define dataloader in order to train on batches and control your GPU/RAM usage. Select a batch size according to your memory.
train_generator = torch.utils.data.DataLoader(train_data, batch_size = 96, shuffle = True) test_generator = torch.utils.data.DataLoader(test_data, batch_size = 96, shuffle = False)
PyTorch Library
PyTorch contains many modules and classes for ANN design and training. In the scope of the homework, very limited subset of those modules will suffice. You can simply use module model class (torch.nn.Module) together with layers (torch.nn.Linear, torch.nn.Conv2d etc.) to create ANN models.
import torch
# example mlp classifier class FullyConnected(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(FullyConnected, self).__init__() self.input_size = input_size self.fc1 = torch.nn.Linear(input_size, hidden_size) self.fc2 = torch.nn.Linear(hidden_size, num_classes) self.relu = torch.nn.ReLU()
def forward(self, x): x = x.view(-1, self.input_size) hidden = self.fc1(x) relu = self.relu(hidden) output = self.fc2(relu) return output # initialize your model
model_mlp = FullyConnected(1024,128,10)
Once you create a model, you can obtain the parameters of desired layer from class attributes.
# get the parameters 1024x128 layer as numpy array params_1024x128 = model_mlp.fc1.weight.data.numpy()
Once you create your classification model, you need to create a loss and an optimizer to compile your model to be trained for the classification task.
# create loss: use cross entropy loss) loss = torch.nn.CrossEntropyLoss()
# create optimizer optimizer = torch.optim.SGD(model_mlp.parameters(), lr = 0.01, momentum = 0.0)
# transfer your model to train mode model_mlp.train()
# transfer your model to eval mode model_mlp.eval()
Upon compiling a torch.nn.Module class for your task, the training and testing operations are simple via calling the class methods. Please refer to the simple explanations of those classes in the web page of the module [2] where you can find examples as well. During training, do not forget optimization steps such as calculating loss, clearing gradients, computing gradients using loss function and taking a gradient descent step using optimizer. Hint: Each of these corresponds to one line of code in training. Also, if you want to work in GPU, don’t forget to transfer your model and data to CUDA.
Homework Task and Deliverables
The homework is composed of 5 parts. In the first part you are to answer general questions on ANN and its training. In the second part, you will implement one Convolutional Layer using NumPy. For the other parts, you will write codes to perform experiments on classification performances under several settings. You will provide the results of the experiments by some visuals and you will interpret the results by your own conclusions.
You should submit a single report in which your answers to the questions, the required experimental results (performance curve plots, visualizations etc.) and your deductions are presented for each part of the homework. Moreover, for the parts 2-5, you should append your Python codes as text to the end of the report for each part to generate the results and the visualizations of the experiments. Namely, all the required tasks for a part can be reproduced by running the related code. The codes should be well structured and well commented. The non-text submissions (e.g. image) or the submissions lacking comments will not be evaluated. Similarly answers/results/conclusions written in code as a comment will not be graded. ANNs implemented in an environment other than PyTorch will get zero grade.
The report should be in portable document format (pdf) and named as hw1 name surname eXXXXXX where name, surname and Xs are to be replaced by your name, surname and digits of your user ID, respectively. You do not need to send any code files to your course assistant(s), since everything should be in your one pdf file.
Do not include the codes in utils.py to the end of your pdf file.
1 Basic Concepts
Answer the following questions.
1.1 Which Function?
ANNs are actually parametric functions which can be used to approximate other functions. What function does an ANNs classifier trained with cross-entropy loss approximates? How is the loss defined to approximate that function? Why?
1.2 Gradient Computation
You are training an ANN by using stochastic gradient descent (SGD) approach with a learning rate γ. You introduce no weight regularization to the loss and no momentum is used in the gradient updates. Under these settings, you are given weights, (wk,wk+1), at iterations k and k+1, respectively. How can you compute the gradient of the loss, L, with respect to w at step k? Express the gradient, ∇wL |w=wk, in terms of (γ,wk,wk+1).
1.3 Some Training Parameters and Basic Parameter Calculations
1. What are batch and epoch in the context of MLP training?
2. Given that the dataset has N samples, what is the number of batches per epoch if the batch size is B?
3. Given that the dataset has N samples, what is the number of SGD iterations if you want to train your ANN for E epochs with the batch size of B?
1.4 Computing Number of Parameters of ANN Classifiers
1. Consider an MLP classifier of K hidden units where the size of each hidden unit is Hk for k=1,...,K. Derive a formula to compute the number of parameters that the MLP has if the input and output dimensions are Din and Dout, respectively.
2. Consider a CNN classifier of K convolutional layers where the spatial size of each layer is Hk×Wk and the number of convolutional filters (kernels) of each layer is Ck for k=1,...,K. Derive a formula to compute the number of parameters that the CNN has if the input dimension is Hin×Win×Cin.
2 Implementing a Convolutional Layer with NumPy
In this part, you will implement your own torch.conv2d function my conv2d with default settings (no padding, 1 stride etc.) using NumPy. You are only required to do the forward propagation. You will check whether your implementation is correct by running it in a small batch of MNIST [4] dataset.
2.1 Experimental Work
Download samples X.npy where X is your last digit of your 7-digit student ID as your input, and kernel.npy as your kernel file. Load them via following code: import numpy as np
# input shape: [batch size, input_channels, input_height, input_width] input=np.load(’samples_X.npy’)
# input shape: [output_channels, input_channels, filter_height, filter width] kernel=np.load(’kernel.npy’) out = my_conv2d(input, kernel)
Your my conv2d function should generate an output file out. Once the aforementioned tasks are performed, create the output image of your function by using the provided part2Plots function in the utils.py file under HW1 folder in ODTUClass course page. Put this output image to your report. Note that you should already have installed PyTorch and Torchvision at this point, since part2Plots function needs these libraries for plotting.
2.2 Discussions
After creating the output image, answer the following questions.
1. Why are Convolutional Neural Networks important? Why are they used in image processing?
2. What is a kernel of a Convolutional Layer? What do the sizes of a kernel correspond to?
3. Briefly explain the output image. What happened here?
4. Why are the numbers in the same column look alike to each other, even though they belong to different images?
5. Why are the numbers in the same row do not look alike to each other, even though they belong to same image?
6. What can be deduced about Convolutional Layers from your answers to Questions 4 and 5?
Put your discussions together with output image to your report.
3 Experimenting ANN Architectures
In this part, you will experiment on several ANN architectures for classification task. Use strides of 1 for convolutions and 2 for max-pooling operations. Use valid padding for both convolutions and pooling operations. Use adaptive moment estimation (Adam) with default parameters for the optimizer. If your computation power allows, use batch size of 50 samples; reduce batch size accordingly otherwise. Use no weight regularization throughout the all experiments.
Split 10% of the training data as the validation set by randomly taking equal number of samples for each class. Hence, you should have three sets: training, validation and testing.
3.1 Experimental Work
In the following, FC-N denotes fully connected (dense) layer of size N, Conv-W×H×N denotes N many 2-D convolution filters of spatial size W×H and MaxPool-2×2 denotes max-pooling operation of spatial pool size 2×2.
The name of the ANN architectures to be experimented and their layers are:
‘mlp 1’ : [FC-32, ReLU] + PredictionLayer
‘mlp 2’ : [FC-32, ReLU, FC-64(no bias)] + PredictionLayer
‘cnn 3’ : [Conv-3×3×16, ReLU, Conv-5×5×8, ReLU, MaxPool-2×2,
Conv-7×7×16, MaxPool-2×2] + PredictionLayer
‘cnn 4’ : [Conv-3×3×16, ReLU,
Conv-3×3×8, ReLU, Conv-5×5×16, ReLU, MaxPool-2×2, Conv-5×5×16, ReLU, MaxPool-2×2] + PredictionLayer
‘cnn 5’ : [Conv-3×3×8, ReLU,
Conv-3×3×16, ReLU, Conv-3×3×8, ReLU, Conv-3×3×16, ReLU, MaxPool-2×2,
Conv-3×3×16, ReLU, Conv-3×3×8, ReLU, MaxPool-2×2] + PredictionLayer
where PredictionLayer = [FC10].
Now, for each architecture, you will perform the following tasks:
1. Using training set, train the ANN for 15 epochs by creating a training loop using your dataloader train generator. Do not forget necessary optimization steps.
During training,
• Record the training loss, training accuracy, validation accuracy for every 10 steps to form loss and accuracy curves (Hint: You can save training loss with loss.item(). You need to write simple codes for calculating training and validation accuracy in model.eval() mode;
• Shuffle training set after each epoch (Hint: Your defined dataloader has a shuffle flag).
After training,
• Compute test accuracy (Hint: Use same code written for calculating validation accuracy);
• Record the weights of the first layer as numpy array (Hint: Call numpy() method of the model.xx.weight.data attribute where xx corresponds to first layer).
2. Repeat 1 for at least 10 times and
• Take the average of the resultant loss and accuracy curves;
• Record the best test accuracy among all the runs;
• Record the weights of the first layer of the trained ANN that has the best test performance.
3. Now, form a dictionary object with the following key-value pairs as the result of the training experiment for the given architecture:
• ‘name’: name of the architecture
• ‘loss curve’: average of the training loss curves from all runs
• ‘train acc curve’: average of the training accuracy curves from all runs
• ‘val acc curve’: average of the validation accuracy curves from all runs
• ‘test acc’: the best test accuracy value from all runs
• ‘weights’: the weights of the first layer of the trained ANN with the best test performance
4. Save the dictionary object with the filename as the architecture name by prefixing ‘part3’ in the front(Hint: Use pickle or json to save dictionary objects to file and load dictionary objects from file).
Once the aforementioned tasks are performed for each architecture, create performance comparison plots by using the provided part3Plots function in the utils.py file under HW1 folder in ODTUClass course page. Note that you should pass all the dictionary objects corresponding to results of the experiments as a list to create performance comparison plots. (Hint: You can load previously saved results and form a list to be passed to the plot function). Add this plot to your report.
Additionally, for all architectures visualize the weights of the first layer by using the provided visualizeWeights function in the utils.py file under HW1 folder in ODTUClass course page. Add these visualizations to your report.
3.2 Discussions
Compare the architectures by considering the performances, the number of parameters, architecture structures and the weight visualizations.
1. What is the generalization performance of a classifier?
2. Which plots are informative to inspect generalization performance?
3. Compare the generalization performance of the architectures.
4. How does the number of parameters affect the classification and generalization performance?
5. How does the depth of the architecture affect the classification and generalization performance?
6. Considering the visualizations of the weights, are they interpretable?
7. Can you say whether the units are specialized to specific classes?
8. Weights of which architecture are more interpretable?
9. Considering the architectures, comment on the structures (how they are designed). Can you say that some architecture are akin to each other? Compare the performance of similarly structured architectures and architectures with different structure.
10. Which architecture would you pick for this classification task? Why?
Put your discussions together with performance plots and weight visualizations to your report.
4 Experimenting Activation Functions
In this part, you will compare rectified linear unit (ReLU) function and the logistic sigmoid function. Use SGD for the training method, constant learning rate of 0.01, 0.0 momentum (no momentum), batch size of 50 samples and use no weight regularization throughout the all experiments.
4.1 Experimental Work
Consider the architectures in 3.1, for each architecture create two torch.nn.Module objects: one with the ReLU activation function (original archs. in in 3.1) and one with the logistic sigmoid activation function (replaces ReLU of archs. in 3.1). Then, perform the following tasks for the two classifiers:
1. Using training set, train the ANN for 15 epochs by creating a training loop using train generator. Do not forget necessary optimization steps.
During training,
• Record the training loss and magnitude of the loss gradient with respect to the weights of the first layer at every 10 steps to form loss and gradient magnitude curves (Hint: You can save training loss with loss.item() and call numpy() method of the first item in the weight.data attribute to obtain the copies of the weights of the first layer at time steps);
• Shuffle training set after each epoch (Hint: Your defined dataloader has a shuffle flag).
After training, form a dictionary object with the following key-value pairs as the result of the training experiment for the given architecture:
• ‘name’: name of the architecture
• ‘relu loss curve’: the training loss curve of the ANN with ReLU
• ‘sigmoid loss curve’: the training loss curve of the ANN with logistic sigmoid
• ‘relu grad curve’: the curve of the magnitude of the loss gradient of the ANN with ReLU
• ‘sigmoid grad curve’: the curve of the magnitude of the loss gradient of the ANN with ReLU
2. Save the dictionary object with the filename as the architecture name by prefixing ‘part4’ in the front (Hint: Use pickle or json to save dictionary objects to file and load dictionary objects from file).
Once the aforementioned tasks are performed for each architecture, create performance comparison plots by using the provided part4Plots function in the utils.py file under HW1 folder in ODTUClass course page. Note that you should pass all the dictionary objects corresponding to results of the experiments as a list to create performance comparison plots. Add this plot to your report.
4.2 Discussions
Compare the architectures by considering the training performances:
1. How is the gradient behavior in different architectures? What happens when depth increases?
2. Why do you think that happens?
3. Bonus: What might happen if we use inputs in the range [0,255] instead of [0.0,1.0]?
Put your discussions together with performance plots to your report.
5 Experimenting Learning Rate
In this part, you will examine the effect of the learning rate in SGD method. Use SGD for the optimizer, constant learning rate, ReLU activation function, 0.0 momentum (no momentum), batch size of 50 samples and use no weight regularization throughout the all experiments. You will vary the initial learning rate during the experiments so that each training will be performed with different learning rate.
Split 10% of the training data as the validation set by randomly taking equal number of samples for each class. Hence, you should have three sets: training, validation and testing.
5.1 Experimental Work
Pick your favorite architecture from 3.1, excluding ‘mlp 1’. Create three torch.nn.Module objects of initial learning rates 0.1, 0.01 and 0.001, respectively. Then, perform the following tasks for the three classifiers:
• Using training set, train the three ANNs for 20 epochs by creating a training loop using train generator. Do not forget necessary optimization steps.
During training,
• Record the training loss and the validation accuracy for every 10 steps to form loss and accuracy curves (Hint: You can save training loss with loss.item() and use your previous code to compute validation accuracy);
• Shuffle training set after each epoch (Hint: Your defined dataloader has a shuffle flag).
After training, form a dictionary object with the following key-value pairs as the result of the training experiment for the given architecture:
• ‘name’: name of the architecture
• ‘losscurve 1’: the training loss curve of the ANN trained with 0.1 learning rate
• ‘losscurve 01’: the training loss curve of the ANN trained with 0.01 learning rate
• ‘losscurve 001’: the training loss curve of the ANN trained with 0.001 learning rate
• ‘val acc curve 1’: the validation accuracy curves of the ANN trained with 0.1 learning rate
• ‘val acc curve 01’: the validation accuracy curves of the ANN trained with 0.01 learning rate
• ‘val acc curve 001’: the validation accuracy curves of the ANN trained with 0.001 learning rate
Once the aforementioned tasks are performed for each architecture, create performance comparison plots by using the provided part5Plots function in the utils.py file under HW1 folder in ODTUClass course page. Note that you should pass a single dictionary objects corresponding to result of the experiment to create performance comparison plots. Add this plot to your report.
Now, you will try to make scheduled learning rate to improve SGD based training.
1. Examine the validation accuracy curve of the ANN trained with 0.1 learning rate. Approximately determine the epoch step where the accuracy stops increasing.
2. Create a torch.nn.Module object with the same parameters as above and with initial learning rate of 0.1.
3. Train that classifier until the epoch step that you determined in 1. Then, set the learning rate to 0.01 and continue training until 30 epochs (Hint: Create an optimizer with the new learning rate and retrain the model with the new optimizer, without reinitializing model).
4. Record only the validation accuracy during this training.
5. Now, plot the validation accuracy curve and determine the epoch step where the accuracy stops increasing.
6. Repeat 2 and 3; however, in 3, continue training with 0.01 until the epoch step that you determined in 5. Then, set the learning rate to 0.001 and continue training until 30 epochs.
7. Repeat 4 and once the training ends, record the test accuracy of the trained model and compare it to the same model trained with Adam in 3.1.
Note: You can increase the number of epochs if you are not to observe the steps where training stops improving.
5.2 Discussions
Compare the effect of learning rate by considering the training performances:
1. How does the learning rate affect the convergence speed?
2. How does the learning rate affect the convergence to a better point?
3. Does your scheduled learning rate method work? In what sense?
4. Compare the accuracy and convergence performance of your scheduled learning rate method with Adam.
Put your discussions together with performance plots to your report.
References
[1] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
[2] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
[3] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser,
[4] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.