Starting from:

$30

CS7641-Assignment 4 Solved

1.    Two Layer Neural Network  **[P]****[W]** 
Perceptron
 
 
 
A single layer perceptron can be thought of as a linear hyperplane as in logistic regression followed by a nonlinear activation function.
𝑑
𝑒𝑖 = ∑ πœƒπ‘–π‘—π‘₯𝑗 + 𝑏𝑖
𝑗=1
π‘œπ‘– = πœ™   πœƒπ‘–π‘—π‘₯𝑗 + 𝑏𝑖) = πœ™(πœƒπ‘‡π‘– π‘₯ + 𝑏𝑖)
where π‘₯ is a d-dimensional vector i.e. π‘₯ ∈ 𝑅𝑑 . It is one datapoint with 𝑑 features. πœƒπ‘– ∈ 𝑅𝑑 is the weight vector for the π‘–π‘‘β„Ž hidden unit, 𝑏𝑖 ∈ 𝑅 is the bias element for the π‘–π‘‘β„Ž hidden unit and πœ™(. ) is a non-linear activation function that has been described below. 𝑒𝑖 is a linear combination of the features in π‘₯𝑗 weighted by πœƒπ‘– whereas π‘œπ‘– is the π‘–π‘‘β„Ž output unit from the activation layer.
Fully connected Layer
Typically, a modern neural network contains millions of perceptrons as the one shown in the previous image. Perceptrons interact in different configurations such as cascaded or parallel. In this part, we describe a fully connected layer configuration in a neural network which comprises multiple parallel perceptrons forming one layer.
We extend the previous notation to describe a fully connected layer. Each layer in a fully connected network has a number of input/hidden/output units cascaded in parallel. Let us a define a single layer of the neural net as follows:  
π‘š demotes the number of hidden units in a single layer 𝑙 whereas 𝑛 denotes the number of units in the previous layer 𝑙 − 1.
𝑒[𝑙] = πœƒ[𝑙]π‘œ[𝑙−1] + 𝑏[𝑙]
where 𝑒[𝑙] ∈ π‘…π‘š is a m-dimensional vector pertaining to the hidden units of the π‘™π‘‘β„Ž layer of the neural network after applying linear operations. Similarly, π‘œ[𝑙−1] is the n-dimensional output vector corresponding to the hidden units of the (𝑙 − 1)π‘‘β„Ž activation layer. πœƒ[𝑙] ∈ π‘…π‘š×𝑛 is the weight matrix of the π‘™π‘‘β„Ž layer where each row of πœƒ[𝑙] is analogous to πœƒπ‘– described in the previous section i.e. each row corresponds to one hidden unit of the π‘™π‘‘β„Ž layer. 
𝑏[𝑙] ∈ π‘…π‘š is the bias vector of the layer where each element of b pertains to one hidden unit of the π‘™π‘‘β„Ž layer. This is followed by element wise non-linear activation function π‘œ[𝑙] = πœ™(𝑒[𝑙]). The whole operation can be summarized as,
π‘œ[𝑙] = πœ™(πœƒ[𝑙]π‘œ[𝑙−1] + 𝑏[𝑙])
where π‘œ[𝑙−1] is the output of the previous layer.
Activation Function
There are many activation functions in the literature but for this question we are going to use Relu and Tanh only.
Relu
The rectified linear unit (Relu) is one of the most commonly used activation functions in deep learning models. The mathematical form is
π‘œ = πœ™(𝑒) = π‘šπ‘Žπ‘₯(0, 𝑒)
 
The derivative of relu function is given as π‘œ′ = πœ™′(𝑒) = { 0    π‘’ ≤ 0
    1    π‘’ 0
 
Tanh
Tanh also known as hyperbolic tangent is like a shifted version of sigmoid activation function with its range going from -1 to 1. Tanh almost always proves to be better than the sigmoid function since the mean of the activations are closer to zero. Tanh has an effect of centering data that makes learning for the next layer a bit easier. The mathematical form of tanh is given as
𝑒𝑒 − 𝑒−𝑒
    π‘œ = πœ™(𝑒) = π‘‘π‘Žπ‘›β„Ž(𝑒) =     −
𝑒𝑒 + 𝑒 𝑒
The derivative of tanh is given as
π‘œ′ = πœ™′(𝑒) = 1 − (  π‘’𝑒 − 𝑒−−𝑒 )2 = 1 − π‘œ2
𝑒𝑒 + 𝑒 𝑒
 
Sigmoid
The sigmoid function is another non-linear function with S-shaped curve. This function is useful in the case of binary classification as its output is between 0 and 1. The mathematical form of the function is
1
    π‘œ = πœ™(𝑒) =     −
1 + 𝑒 𝑒
 
The derivation of the sigmoid function has a nice form and is given as
    π‘œ′ = πœ™′(𝑒) =     1 − (1 −     1 − ) = πœ™(𝑒)(1 − πœ™(𝑒))
    1 + 𝑒 𝑒    1 + 𝑒 𝑒
 
 
Note: We will not be using sigmoid activation function for this assignment. This is included only for the sake of completeness.
 
Mean Squared Error
It is an estimator that measures the average of the squares of the errors i.e. the average squared difference between the actual and the estimated values. It estimates the quality of the learnt hypothesis between the actual and the predicted values. It's non-negative and closer to zero, the better the learnt function is.
Implementation details
For regression problems as in this exercise, we compute the loss as follows:
    1    π‘
𝑀𝑆𝐸 =   ∑ (𝑦𝑖 − 𝑦^𝑖)2
2𝑁
𝑖=1
where 𝑦𝑖 is the true label and 𝑦^𝑖 is the estimated label. We use a factor of 21𝑁 instead of 𝑁1 to simply the derivative of loss function.
Forward Propagation
We start by initializing the weights of the fully connected layer using Xavier initialization Xavier initialization (http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf). During training, we pass all the data points through the network layer by layer using forward propagation. The main equations for forward prop have been described below.
𝑒[0] = π‘₯
𝑒[1] = πœƒ[1]𝑒[0] + 𝑏[1]
π‘œ[1] = π‘‡π‘Žπ‘›β„Ž(𝑒[1]) 𝑒[2] = πœƒ[2]π‘œ[1] + 𝑏[2]
𝑦 Μ‚ = π‘œ[2] = 𝑅𝑒𝑙𝑒(𝑒[2])
Then we get the output and compute the loss
    1    ∑𝑁 (    ^)2
𝑙 =𝑦𝑖 − 𝑦𝑖
2𝑁
𝑖=1
Backward propagation
After the forward pass, we do back propagation to update the weights and biases in the direction of the negative gradient of the loss function. So, we update the weights and biases using the following formulas
    [2]    [2]    ∂𝑙
    πœƒ    := πœƒ    − π‘™π‘Ÿ ×  
∂πœƒ[2]
𝑏[2] := 𝑏[2] − π‘™π‘Ÿ ×     𝑙 ∂
∂𝑏[2]
∂ πœƒ[1] := πœƒ[1] − π‘™π‘Ÿ ×     𝑙
∂πœƒ[1]
𝑏[1] := 𝑏[1] − π‘™π‘Ÿ ×     𝑙 ∂
∂𝑏[1] where π‘™π‘Ÿ is the learning rate. It decides the step size we want to take in the direction of the negative gradient.
To compute the terms ∂𝑙[𝑖] and ∂𝑙[𝑖] we use chain rule for differentiation as follows:
    ∂πœƒ    ∂𝑏
 
So, ∂[𝑙2] is the differentiation of the loss function at point π‘œ[2]  
∂π‘œ
 
∂π‘œ[2] is the differentiation of the Relu function at point 𝑒[2]  
∂𝑒[2]
 
∂𝑒[[22]] is equal to π‘œ[1]  
∂πœƒ  
∂𝑒[[22]] is equal to 1.  
∂𝑏
 
To compute ∂𝑙[2] , we need π‘œ[2], 𝑒[2]&π‘œ[1] which are calculated during forward propagation. So we need to store
∂πœƒ these values in cache variables during forward propagation to be able to access them during backward propagation. Similarly for calculating other partial derivatives, we store the values we'll be needing for chain rule in cache. These values are obtained from the forward propagation and used in backward propagation. The cache is implemented as a dictionary here where the keys are the variable names and the values are the variables values. 
 
Also, the functional form of the MSE differentiation and Relu differentiation are given by  
 
    ∂𝑙    ( [2]    )
    =    π‘œ    − 𝑦
∂π‘œ[2]
     ∂𝑙    ∂𝑙    [2] 0)
On vectorization, the above equations become:
    ∂𝑙     
    ∂π‘œ[2]    π‘›
    ∂𝑙    = 1    ∂𝑙 π‘œ[1]
∂𝑙
    ∂𝑏[2]    π‘›    ∂𝑒[2]
This completes the differentiation of loss function w.r.t to parameters in the second layer. We now move on to the first layer, the equations for which are given as follows:  
 
 
Where
 
  π‘₯
 
Note that ∂π‘œ[[11]] is the differentiation of the Tanh function at 𝑒[1].
∂𝑒
The above equations outline the forward and backward propagation process for a 2-layer fully connected neural net with Tanh as the first activation layer and Relu has the second one. The same process can be extended to different neural networks with different activation layers.
Code Implementation:
∂𝑙
π‘‘πΏπ‘œπ‘ π‘ _π‘œ2 =⟹ π‘‘π‘–π‘š = (1, 379)
 
∂π‘œ[2]
π‘‘πΏπ‘œπ‘ π‘ _𝑒2 = π‘‘πΏπ‘œπ‘ π‘ _π‘œ
π‘‘πΏπ‘œπ‘ π‘ _π‘‘β„Žπ‘’π‘‘π‘Ž2 = π‘‘πΏπ‘œπ‘ π‘ _𝑒
π‘‘πΏπ‘œπ‘ π‘ _𝑏2 = π‘‘πΏπ‘œπ‘ π‘ _𝑒
π‘‘πΏπ‘œπ‘ π‘ _π‘œ1 = π‘‘πΏπ‘œπ‘ π‘ _𝑒
π‘‘πΏπ‘œπ‘ π‘ _𝑒1 = π‘‘πΏπ‘œπ‘ π‘ _π‘œ
π‘‘πΏπ‘œπ‘ π‘ _π‘‘β„Žπ‘’π‘‘π‘Ž1 = π‘‘πΏπ‘œπ‘ π‘ _𝑒
π‘‘πΏπ‘œπ‘ π‘ _𝑏1 = π‘‘πΏπ‘œπ‘ π‘ _𝑒
Note: Training set has 379 examples.
⟹ π‘‘π‘–π‘š = (1, 379)
⟹ π‘‘π‘–π‘š = (1, 20)
⟹ π‘‘π‘–π‘š = (1, 1)
⟹ π‘‘π‘–π‘š = (20, 379)
⟹ π‘‘π‘–π‘š = (20, 379)
⟹ π‘‘π‘–π‘š = (20, 13)
⟹ π‘‘π‘–π‘š = (20, 1)
 
Question
In this question, you will implement a two layer fully connected neural network. You will also experiment with different activation functions and optimization techniques. Functions with comments "TODO: implement this" are for you to implement. We provide two activation functions here - Relu and Tanh. You will implement a neural network that would have tanh activation followed by relu layer.
You'll also implement Gradient Descent (GD) and Batch Gradient Descent (BGD) algorithms for training these neural nets. GD is mandatory for all. BGD is bonus for undergraduate students but mandatory for graduate students.
We'll train this neural net on boston house-prices dataset. Graduate students have to use both GD and BGD to optimize their neural net. Undergraduate students have to implement GD while BGD is bonus for them. Note: it is possible you'll run into nan or negative values for loss. This happens because of the small dataset we're using and some numerical stability issues that arise due to division by zero, natural log of zeros etc. You can experiment with the total number of iterations to mitigate this.
You're free to tune hyperparameters like the batch size, number of hidden units in each layer etc. if that helps you in achieving the desired MSE values to pass the autograder tests. However, you're advised to try out the default values first.
Deliverables for this question:
1.    Loss plot and MSE value for neural net with gradient descent
2.    Loss plot and MSE value for neural net with batch gradient descent (mandatory for graduate students, bonus for undergraduate students)
In [3]: ''' 
Training the Neural Network with Gradient Descent, you do not need to mo dify this cell.  
''' 
# load dataset from NN import dlnet 
dataset = load_boston() # load the dataset x, y = dataset.data, dataset.target y = y.reshape(-1,1) 
 
x = MinMaxScaler().fit_transform(x) #normalize data y = MinMaxScaler().fit_transform(y) 
 
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1
) #split data 
x_train, x_test, y_train, y_test = x_train.T, x_test.T, y_train.reshape(
1,-1), y_test #condition data 
 
nn = dlnet(x_train,y_train,lr=0.001) # initalize neural net class nn.gradient_descent(x_train, y_train, iter = 60000) #train 
 
# create figure 
fig = plt.plot(np.array(nn.loss).squeeze()) plt.title(f'Training: {nn.neural_net_type}') plt.xlabel("Epoch") plt.ylabel("Loss")  
Loss after iteration 0 : 0.08998720790406557 
Loss after iteration 2000 : 0.03447045570792528 
Loss after iteration 4000 : 0.026082025875585717 
Loss after iteration 6000 : 0.021985761323744975 
Loss after iteration 8000 : 0.019221572045989344 
Loss after iteration 10000 : 0.0171312387149271 
Loss after iteration 12000 : 0.015481615134847618 
Loss after iteration 14000 : 0.014150468155408014 
Loss after iteration 16000 : 0.013056265947055324 
Loss after iteration 18000 : 0.012144233428924957 
Loss after iteration 20000 : 0.011375969884101582 
Loss after iteration 22000 : 0.010723556429773134 
Loss after iteration 24000 : 0.010166039039175113 Loss after iteration 26000 : 0.00968724869375962 
Loss after iteration 28000 : 0.009274407828538368 
Loss after iteration 30000 : 0.008917215011552453 Loss after iteration 32000 : 0.00860479528994069 
Loss after iteration 34000 : 0.008327716324795615 Loss after iteration 36000 : 0.00808548436439606 
Loss after iteration 38000 : 0.007873794902733459 
Loss after iteration 40000 : 0.007687575073014062 
Loss after iteration 42000 : 0.007524085749641826 
Loss after iteration 44000 : 0.0073800770899116325 
Loss after iteration 46000 : 0.007252799959909596 Loss after iteration 48000 : 0.00713989928234579 
Loss after iteration 50000 : 0.007039353657066983 
Loss after iteration 52000 : 0.006947090828255051 
Loss after iteration 54000 : 0.0068614629996172705 
Loss after iteration 56000 : 0.0067844664658564206 
Loss after iteration 58000 : 0.0067148547698228644 
Out[3]: Text(0,0.5,'Loss')
 
In [4]: ''' 
Testing Neural Network with Gradient Descent, you do not need to modify  this cell.  ''' 
from NN import dlnet 
y_predicted = nn.predict(x_test) # predict y_test = y_test.reshape(1,-1) 
print("Mean Squared Error (MSE)", (np.sum((y_predicted-y_test)**2)/y_tes t.shape[1])) 
Mean Squared Error (MSE) 0.013986255118317744  
In [5]: ''' 
Training the Neural Network with Batch Gradient Descent, you do not need to modify this cell.  
''' 
# load dataset 
dataset = load_boston() # load the dataset x, y = dataset.data, dataset.target y = y.reshape(-1,1) 
 
x = MinMaxScaler().fit_transform(x) #normalize data y = MinMaxScaler().fit_transform(y) 
 
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1
) #split data 
x_train, x_test, y_train, y_test = x_train.T, x_test.T, y_train.reshape(
1,-1), y_test #condition data 
 
nn = dlnet(x_train,y_train,lr=0.001) # initalize neural net class nn.batch_gradient_descent(x_train, y_train, iter = 60000) #train  
# create figure 
fig = plt.plot(np.array(nn.loss).squeeze()) plt.title(f'Training: {nn.neural_net_type}') plt.xlabel("Epoch") plt.ylabel("Loss")  
Loss after iteration 0 : 0.08928835763543144 
Loss after iteration 1000 : 0.043562868973035324 
Loss after iteration 2000 : 0.05029091064924741 
Loss after iteration 3000 : 0.03916083817916013 
Loss after iteration 4000 : 0.02096771284249359 
Loss after iteration 5000 : 0.020357253297347166 
Loss after iteration 6000 : 0.012577946044952012 Loss after iteration 7000 : 0.01927458308777301 
Loss after iteration 8000 : 0.017217220738397848 
Loss after iteration 9000 : 0.020138901693840038 
Loss after iteration 10000 : 0.022686181787599513 
Loss after iteration 11000 : 0.017128729598504245 
Loss after iteration 12000 : 0.008560348147467324 
Loss after iteration 13000 : 0.013064152161595661 
Loss after iteration 14000 : 0.012715460107326899 
Loss after iteration 15000 : 0.013238575056667395 
Loss after iteration 16000 : 0.008855830720025548 
Loss after iteration 17000 : 0.020110119616157142 
Loss after iteration 18000 : 0.016840256002458025 
Loss after iteration 19000 : 0.006636662673024662 
Loss after iteration 20000 : 0.007889095546028117 
Loss after iteration 21000 : 0.010807881716109621 Loss after iteration 22000 : 0.01204456005399052 
Loss after iteration 23000 : 0.007856637641787767 
Loss after iteration 24000 : 0.012763599711750881 
Loss after iteration 25000 : 0.013009945137541973 
Loss after iteration 26000 : 0.010229012465629428 
Loss after iteration 27000 : 0.005461836687757919 
Loss after iteration 28000 : 0.007639941036023259 Loss after iteration 29000 : 0.00921458312857956 
Loss after iteration 30000 : 0.007159351303519348 
Loss after iteration 31000 : 0.006083718174101598 
Loss after iteration 32000 : 0.014918882328224468 
Loss after iteration 33000 : 0.011152743079568974 Loss after iteration 34000 : 0.00419858926328067 
Loss after iteration 35000 : 0.006879245165985816 
Loss after iteration 36000 : 0.009014583473643279 
Loss after iteration 37000 : 0.008980986197331977 
Loss after iteration 38000 : 0.005506747269612967 
Loss after iteration 39000 : 0.010374096323579092 
Loss after iteration 40000 : 0.011347164275415973 
Loss after iteration 41000 : 0.008160226431348698 
Loss after iteration 42000 : 0.005139082485441008 
Loss after iteration 43000 : 0.005621330218206484 
Loss after iteration 44000 : 0.008935402946941664 
Loss after iteration 45000 : 0.006075653728174685 
Loss after iteration 46000 : 0.006480208787365516 Loss after iteration 47000 : 0.01178145307689564 
Loss after iteration 48000 : 0.008979423123285434 
Loss after iteration 49000 : 0.004017361226679237 
Loss after iteration 50000 : 0.005599094048537514 
Loss after iteration 51000 : 0.008275053941251236 
Loss after iteration 52000 : 0.005635957802622802 
Loss after iteration 53000 : 0.004602045536013525 
Loss after iteration 54000 : 0.009003824206229305 
Loss after iteration 55000 : 0.009251667734470538 
Loss after iteration 56000 : 0.006599624227931043 Loss after iteration 57000 : 0.004699796177210642 
Loss after iteration 58000 : 0.004540675707934483 
Loss after iteration 59000 : 0.00900445247084298 
Out[5]: Text(0,0.5,'Loss')
 
Mean Squared Error (MSE) 0.013985397524602275 
2: Image Classification based on Convolutional
Neural Networks  **[W]**
Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research. In this part, you will build a convolutional neural network based on Keras to solve the image classification task for CIFAR10. If you haven't installed TensorFlow, you can install the package by pip command or train your model by uploading HW4 notebook to Colab (https://colab.research.google.com/) directly. Colab contains all packages you need for this section.
Hint1: First contact with Keras (https://keras.io/about/)
Hint2: How to Install Keras (https://www.pyimagesearch.com/2016/07/18/installing-keras-for-deep-learning/)
Hint3:CS231n Tutorial (Layers used to build ConvNets) (https://cs231n.github.io/convolutional-networks/)
Environment Setup
In [7]: from __future__ import print_function import tensorflow as tf 
from tensorflow.keras.datasets import cifar10 from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
, Activation, Dropout 
from tensorflow.keras.layers import LeakyReLU from sklearn.utils import shuffle import numpy as np import matplotlib.pyplot as plt 
D:\Tools\Python\Anaconda3\lib\site-packages\h5py\__init__.py:36: Future Warning: Conversion of the second argument of issubdtype from `float` t o `np.floating` is deprecated. In future, it will be treated as `np.flo at64 == np.dtype(float).type`.   from ._conv import register_converters as _register_converters 
Load CIFAR10 dataset
We use CIFAR10 dataset to train our model. This is a dataset of 50,000 32x32 color training images and 10,000 test images, labeled over 10 categories. Each example is 32 × 32 pixel color image of various objects.
In [8]: # Helper function, You don't need to modify it 
# split data between train and test sets 
(x_train, y_train), (x_test, y_test) = cifar10.load_data()  
# input image dimensions img_rows, img_cols = 32, 32 number_channels = 3 #set num of classes num_classes = 10 
 if tf.keras.backend.image_data_format() == 'channels_first':     x_train = x_train.reshape(x_train.shape[0], number_channels, img_row s, img_cols) 
    x_test = x_test.reshape(x_test.shape[0], number_channels, img_rows, img_cols) 
    input_shape = (number_channels, img_rows, img_cols) else:     x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, numb er_channels) 
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, number_ channels) 
    input_shape = (img_rows, img_cols, number_channels) 
 
x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 
print('x_train shape:', x_train.shape) print('x_test shape:', x_test.shape) print(x_train.shape[0], 'train samples') print(x_test.shape[0], 'test samples') 
 
cifar10_classes = ["airplane", "automobile", "bird", "cat", "deer", "do g", "frog", "horse", "ship", "truck"] 
# convert class vectors to binary class matrices 
y_train = tf.keras.utils.to_categorical(y_train, num_classes) y_test = tf.keras.utils.to_categorical(y_test, num_classes) 
x_train shape: (50000, 32, 32, 3) x_test shape: (10000, 32, 32, 3) 
50000 train samples 
10000 test samples 
Load some images from CIFAR10
 
As you can see from above, the CIFAR10 dataset contains selection of objects. The images have been sizenormalized and objects remain centered in fixed-size images.
Build convolutional neural network model
In this part, you need to build a convolutional neural network as described below. The architecture of the model is:
[INPUT - CONV - CONV - MAXPOOL - DROPOUT - CONV - CONV - MAXPOOL - DROPOUT - FC1 DROPOUT - FC2]
INPUT: [32 × 32 × 3] will hold the raw pixel values of the image, in this case, an image of width 32, height 32, and with 3 color channels. This layer should give 16 filters and have appropriate padding to maintain shape.
CONV: Conv. layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to the input volume. We decide to set the kernel_size 3 × 3 for the both Conv. layers.
For example, the output of the Conv. layer may look like [32 × 32 × 32] if we use 32 filters. Again, we use padding to maintain shape.
MAXPOOL: MAXPOOL layer will perform a downsampling operation along the spatial dimensions (width, height). With pool size of 2 × 2, resulting shape takes form 16 × 16.
DROPOUT: DROPOUT layer with the dropout rate of 0.25, to prevent overfitting.
CONV: Additonal Conv. layer take outputs from above layers and applies more filters. The Conv. layer may look like [16 × 16 × 32]. We set the kernel_size 3 × 3 and use padding to maintain shape for both Conv. layers.
CONV: Additonal Conv. layer take outputs from above layers and applies more filters. The Conv.
layer may look like [16 × 16 × 64].
MAXPOOL: MAXPOOL layer will perform a downsampling operation along the spatial dimensions (width, height).
DROPOUT: Dropout layer with the dropout rate of 0.25, to prevent overfitting.
FC1: Dense layer which takes input above layers, and has 256 neurons. Flatten operations may be useful.
DROPOUT: Dropout layer with the dropout rate of 0.5, to prevent overfitting.
FC2: Dense layer with 10 neurons, and softmax activation, is the final layer. The dimension of the output space is the number of classes.
Activation function: Use LeakyReLU unless otherwise indicated to build you model architecture.
Note that while this is a suggested model design, you may use other architectures and experiment with different layers for better results.
In [11]: # Helper function, You don't need to modify it 
# Show the architecture of the model achi=plt.imread('./images/Architecture.png') fig = plt.figure(figsize=(10,10)) plt.imshow(achi) 
Out[11]: <matplotlib.image.AxesImage at 0x22536c8ac50
 
Defining Variables
You now need to set training variebles in the init() function in cnn.py. Once you have defined variables you may use the cell below to see them.
In [13]: # Helper function, You don't need to modify it 
# You can adjust parameters to train your model in __init__() in cnn.py 
 
from cnn import CNN 
 net = CNN() 
batch_size, epochs, init_lr = net.get_vars() 
print(f'Batch Size\t: {batch_size} \nEpochs\t\t: {epochs} \nLearning Rat e\t: {init_lr} \n') 
Batch Size     : 64  
Epochs      : 10  
Learning Rate     : 0.001  
 
Defining model
You now need to complete the create_net() function in cnn.py to define your model structure. Once you have defined a model structure you may use the cell below to examine your architecture. 
In [15]: # Helper function, You don't need to modify it 
# model.summary() gives you details of your architecture. #You can compare your architecture with the 'Architecture.png' net = CNN() 
 
s = tf.keras.backend.clear_session() model=net.create_net() model.summary() 
WARNING:tensorflow:From D:\Tools\Python\Anaconda3\lib\site-packages\ten sorflow\python\ops\resource_variable_ops.py:435: colocate_with (from te nsorflow.python.framework.ops) is deprecated and will be removed in a f uture version. 
Instructions for updating: 
Colocations handled automatically by placer. 
WARNING:tensorflow:From D:\Tools\Python\Anaconda3\lib\site-packages\ten sorflow\python\keras\layers\core.py:143: calling dropout (from tensorfl ow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: 
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 
1 - keep_prob`. 
_________________________________________________________________ 
Layer (type)                 Output Shape              Param #    ================================================================= conv2d (Conv2D)              (None, 32, 32, 64)        1792       _________________________________________________________________ conv2d_1 (Conv2D)            (None, 32, 32, 64)        36928      _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 16, 16, 64)        0          _________________________________________________________________ dropout (Dropout)            (None, 16, 16, 64)        0          _________________________________________________________________ conv2d_2 (Conv2D)            (None, 16, 16, 128)       73856      _________________________________________________________________ conv2d_3 (Conv2D)            (None, 16, 16, 128)       147584     _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 8, 8, 128)         0          _________________________________________________________________ conv2d_4 (Conv2D)            (None, 8, 8, 256)         295168     _________________________________________________________________ conv2d_5 (Conv2D)            (None, 8, 8, 256)         590080     _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 4, 4, 256)         0          _________________________________________________________________ flatten (Flatten)            (None, 4096)              0          _________________________________________________________________ dropout_1 (Dropout)          (None, 4096)              0          _________________________________________________________________ dense (Dense)                (None, 256)               1048832    _________________________________________________________________ dense_1 (Dense)              (None, 10)                2570       ================================================================= Total params: 2,196,810 
Trainable params: 2,196,810 
Non-trainable params: 0 
_________________________________________________________________ 
Compiling model
Next prepare the model for training by completing compile_model() in cnn.py Remember we are performing 10way clasification when selecting a loss function.
 
<tensorflow.python.keras.engine.sequential.Sequential object at 0x00000
22528C576A0 
Train the network
Tuning: Training the network is the next thing to try. You can set your parameter at the Defining Variable section. If your parameters are set properly, you should see the loss of the validation set decreased and the value of accuracy increased. It may take more than 30 minutes to train your model.
Expected Result: You should be able to achieve more than 80% accuracy on the test set to get full 15 points. If you achieve accuracy between 75% to 79%, you will only get half points of this part.
Train your own CNN model 
  
Train on 50000 samples, validate on 10000 samples 
WARNING:tensorflow:From D:\Tools\Python\Anaconda3\lib\site-packages\ten sorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.o ps.math_ops) is deprecated and will be removed in a future version. 
Instructions for updating: Use tf.cast instead. 
Learning rate: 0.001 
Epoch 1/10 
50000/50000 [==============================] - 545s 11ms/sample - loss: 
1.5672 - acc: 0.4195 - val_loss: 1.1528 - val_acc: 0.5833 
Learning rate: 0.0009000000000000001 
Epoch 2/10 
50000/50000 [==============================] - 540s 11ms/sample - loss: 
1.0561 - acc: 0.6255 - val_loss: 0.9274 - val_acc: 0.6753 
Learning rate: 0.0008100000000000001 
Epoch 3/10 
50000/50000 [==============================] - 554s 11ms/sample - loss: 
0.8499 - acc: 0.6996 - val_loss: 0.7930 - val_acc: 0.7235 
Learning rate: 0.0007290000000000002 
Epoch 4/10 
50000/50000 [==============================] - 548s 11ms/sample - loss: 
0.7428 - acc: 0.7402 - val_loss: 0.6829 - val_acc: 0.7627 
Learning rate: 0.0006561000000000001 
Epoch 5/10 
50000/50000 [==============================] - 542s 11ms/sample - loss: 
0.6561 - acc: 0.7691 - val_loss: 0.6374 - val_acc: 0.7786 
Learning rate: 0.00059049 
Epoch 6/10 
50000/50000 [==============================] - 566s 11ms/sample - loss: 
0.5815 - acc: 0.7951 - val_loss: 0.6103 - val_acc: 0.7926 
Learning rate: 0.000531441 
Epoch 7/10 
50000/50000 [==============================] - 563s 11ms/sample - loss: 
0.5318 - acc: 0.8136 - val_loss: 0.5765 - val_acc: 0.8024 
Learning rate: 0.0004782969000000001 
Epoch 8/10 
50000/50000 [==============================] - 568s 11ms/sample - loss: 
0.4811 - acc: 0.8316 - val_loss: 0.5749 - val_acc: 0.8046 
Learning rate: 0.0004304672100000001 
Epoch 9/10 
50000/50000 [==============================] - 562s 11ms/sample - loss: 
0.4377 - acc: 0.8446 - val_loss: 0.5859 - val_acc: 0.8052 
Learning rate: 0.0003874204890000001 
Epoch 10/10 
50000/50000 [==============================] - 540s 11ms/sample - loss: 
0.3950 - acc: 0.8589 - val_loss: 0.5595 - val_acc: 0.8138 
Test loss: 0.5595157977819443 
Test accuracy: 0.8138 
  
dict_keys(['loss', 'acc', 'val_loss', 'val_acc', 'lr']) 
 
 
3: Random Forests **[P]** **[W]**
NOTE: Please use sklearn's DecisionTreeClassifier in your Random Forest implementation. You can find more details about this classifier here. (https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier_)
3.1 Random Forest Implementation **[P]**
The decision boundaries drawn by decision trees are very sharp, and fitting a decision tree of unbounded depth to a list of examples almost inevitably leads to overfitting. In an attempt to decrease the variance of a decision tree, we're going to use a technique called 'Bootstrap Aggregating' (often abbreviated 'bagging'). This stems from the idea that a collection of weak learners can learn decision boundaries as well as a strong learner. This is commonly called a Random Forest.
We can build a Random Forest as a collection of decision trees, as follows:
1. For every tree in the random forest, we're going to
a)    Subsample the examples with replacement. Note that in this question, the size of the subsample data isequal to the original dataset.
b)    From the subsamples in a), choose attributes at random to learn on in accordance with a providedattribute subsampling rate. Based on what it was mentioned in the class, we randomly pick features in each split. We use a more general approach here to make the programming part easier. Let's randomly pick some features (70% percent of features) and grow the tree based on the pre-determined randomly selected features. Therefore, there is no need to find random features in each split.
c)    Fit a decision tree to the subsample of data we've chosen to a certain depth.
Classification for a random forest is then done by taking a majority vote of the classifications yielded by each tree in the forest after it classifies an example.
In RandomForest Class,
1.    X is assumed to be a matrix with num_training rows and num_features columns where num_training is the number of total records and num_features is the number of features of each record.
2.    y is assumed to be a vector of labels of length num_training.
NOTE: Lookout for TODOs for the parts that needs to be implemented.
3.2 Hyperparameter Tuning with a Random Forest  **[P]**
In machine learning, hyperparameters are parameters that are set before the learning process begins. The max_depth, num_estimators, or max_features variables from 3.1 are examples of different hyperparameters for a random forest model. In this section, you will tune your random forest model on an e-commerce dataset to achieve a high accuracy on a classifying revenue sessions (whether a customer will purchase a product) from user behavior.
Let's first review the dataset in a bit more detail.
Dataset Objective
Imagine that we are doctors working on a cure for heart disease by using machine learning to categorize patients. We know that narrowing arteries are an early indicator of disease. We are tasked with the responsibility of coming up with a method for determining the likelihood of patient having narrowing arteries. We will then use this information to decide which patients to run further tests on for treatment.
After much deliberation amongst the team, you come to a conclusion that we can use past patient data to predict the future occurence of disease.
We will use our random forest algorithm from Q3.1 to predict if a pateient may have indicators of heart disease.
You can find more information on the dataset here (https://archive.ics.uci.edu/ml/datasets/heart+disease). 
Loading the dataset
The dataset that the company has collected has the following features: Only 14 used out of a potential 76
Inputs:
1.    (age)
2.    (type)
3.    (cp) chest pain type
4.    (trestbps) resting blood pressure (in mm Hg on admission to the hospital)
5.    (chol) serum cholestoral in mg/dl
6.    (fbs) (fasting blood sugar 120 mg/dl) (1 = true; 0 = false)
7.    (restecg) resting electrocardiographic results:
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria 8. (thalach) maximum heart rate achieved
9.    (exang) exercise induced angina (1 = yes; 0 = no)
10.    (oldpeak) ST depression induced by exercise relative to rest
11.    (slope) the slope of the peak exercise ST segment
Value 1: upsloping
Value 2: flat
Value 3: downsloping
12.    (ca) number of major vessels (0-3) colored by flourosopy
13.    (thal) 3 = normal; 6 = fixed defect; 7 = reversable defect
Output:
1. (num) target value:
0 means <50% chance of narrowing arteries
1+ means greater than 50% chance of narrowing arteries
Your random forest model will try to predict this variable.
In [21]: # Logic for loading in datasets. DO NOT MODIFY anything in this block. 
 
#This is a Helper cell. DO NOT MODIFY CODE IN THIS CELL from sklearn import preprocessing import pandas as pd 
preprocessor = preprocessing.LabelEncoder() 
 
data_train = pd.read_csv("data/heart_disease_cleaveland_train.csv") data_test = pd.read_csv("data/heart_disease_cleaveland_test.csv")  
X_train = data_train.drop(columns = 'num') y_train = data_train['num'] y_train = y_train.to_numpy() y_train[y_train 1] = 1 
X_test = data_test.drop(columns = 'num') X_test = np.array(X_test) y_test = data_test['num'] y_test = y_test.to_numpy() y_test[y_test 1] = 1 
#y_test = np.array() 
X_train, y_train, X_test, y_test = np.array(X_train), np.array(y_train), np.array(X_test), np.array(y_test) 
In the following codeblock, train your random forest model with different values for max_depth, n_estimators, or max_features and evaluate each model on the held-out test set. Try to choose a combination of hyperparameters that maximizes your prediction accuracy on the test set (aim for 75%+). Once you are satisfied with your chosen parameters, change the default values for max_depth, n_estimators, and max_features in the init function of your RandomForest class in random_forest.py to your chosen values, and then submit this file to Gradescope. You must achieve at least a 75% accuracy against the test set in Gradescope to receive full credit for this section.
In [35]: """ TODO:  
n_estimators defines how many decision trees are fitted for the random f orest. 
max_depth defines a stop condition when the tree reaches to a certain de pth. 
max_features controls the percentage of features that are used to fit ea ch decision tree. 
 
Tune these three parameters to achieve a better accuracy. While you can  use the provided test set to  
evaluate your implementation, you will need to obtain 75% on the test te st set to receive full credit  for this section. 
""" 
from random_forest import RandomForest import sklearn.ensemble n_estimators = 9 #Hint: Consider values between 5-12. max_depth = 3 # Hint: Consider values betweeen 3-12 max_features = 0.7 # Hint: Consider values betweeen 0.7-1.0. 
 
random_forest = RandomForest(n_estimators, max_depth, max_features) 
 
random_forest.fit(X_train, y_train) 
     
accuracy=random_forest.OOB_score(X_test, y_test) 
     
print("accuracy: %.4f" % accuracy) accuracy: 0.7632 
3.3 Plotting Feature Importance **[W]**
While building tree-based models, it's common to quantify how well splitting on a particular feature in a decision tree helps with predicting the target label in a dataset. Machine learning practicioners typically use "Gini importance", or the (normalized) total reduction in entropy brought by that feature to evaluate how important that feature is for predicting the target variable.
Gini importance is typically calculated as the reduction in entropy from reaching a split in a decision tree weighted by the probability of reaching that split in the decision tree. Sklearn internally computes the probability for reaching a split by finding the total number of samples that reaches it during the training phase divided by the total number of samples in the dataset. This weighted value is our feature importance.
Let's think about what this metric means with an example. A high probabiity of reaching a split on "Age" in a decision tree trained on our patient dataset (many samples will reach this split for a decision) and a large reduction in entropy from splitting on "Age" will result in a high feature importance value for "Age". This could mean "Age" is a very important feature for predicting a patients probability of disease. On the other hand, a low probability of reaching a split on "Cholesterol (chol)" in a decision tree (few samples will reach this split for a decision) and a low reduction in entropy from splitting on "Cholesterol (chol)" will result in a low feature importance value. This could mean "Cholesterol (chol)" is not a very informative feature for predicting a patients probability of disease in our decision tree. Thus, the higher the feature importance value, the more important the feature is to predicting the target label.
Fortunately for us, fitting a sklearn.DecisionTreeClassifier to a dataset auomatically computes the Gini importance for every feature in the decision tree and stores these values in a featureimportances variable. Review the docs for more details on how to access this variable (https://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fe
In the function below, display a bar plot that shows the feature importance values for at least one decision tree in your tuned random forest from Q3.2, and briefly comment on whether any features have noticeably higher / or lower importance weights than others. [Note that there isn't a "correct" answer here. We simply want you to investigate how different features in your random forest contribute to predicting the target variable]. Trestbps, CP and age seem to be important, probably because older people are more likely to get sick.
Moreover, chest pain type and resting blood pressure reflect the health level of patients, which is closely related to whether they are ill or not.
 
4: SVM **[W]** **[P]**
4.1 Fitting an SVM classifier by hand **[W]**
Consider a dataset with the following points in 2-dimensional space:
π‘₯1 π‘₯2 𝑦
 
    0    0 -1
    0    2 -1
    2    0 -1
    2    2    1
    4    0    1
    4    4    1
Here, π‘₯1 and π‘₯2 are features and 𝑦 is the label.
The max margin classifier has the formulation,
min ||πœƒ||2
𝑠. 𝑑. 𝑦𝑖(π±π’πœƒ + 𝑏) ≥ 1    ∀ 𝑖
Hint: 𝐱𝐒 are the suppport vectors. Margin is equal to 1 and full margin is equal to 2 . You might find it
    ||πœƒ||    ||πœƒ||
useful to plot the points in a 2D plane.
(1)    Are the points linearly separable? Does adding the point 𝐱 = (4, 2), 𝑦 = 0 change the separability? (2 pts) The points are linearly separable. If adding the point 𝐱 = (4, 2), 𝑦 = 0 the points will change to nonlinearseparable.
(2)    According to the max-margin formulation, find the separating hyperplane. (4 pts) The separating hyperplane is π‘₯1+π‘₯2−3 = 0.
(3)    Find a vector parallel to the optimal vector πœƒ. (4 pts) 
A vector parallel to the optimal vector πœƒ :  .
(4)    Calculate the value of the margin achieved by this πœƒ? (4 pts) The value of the margin achieved by this πœƒ: 1/||πœƒ|| = 1/2
(5)    Solve for πœƒ, given that the margin is equal to 1/||πœƒ||. (4 pts) πœƒ = (1, 1)
(6)    If we remove one of the points from the original data the SVM solution might change. Find all such pointswhich change the solution. (2 pts) 
All such points: (0,2), (2,0), (2,2), (4,0)
4.2 Feature Mapping (10 Pts) **[P]**
Let's look at a dataset where the datapoint can't be classified with a good accuracy using a linear classifier. Run the below cell to generate the dataset.
We will also see what happens when we try to fit a linear classifier to the dataset. 
 
In [37]: # DO NOT CHANGE 
 def visualize_decision_boundary(X, y, feature_new=None, h=0.02): 
    ''' 
    You don't have to modify this function 
     
    Function to vizualize decision boundary 
     
    feature_new is a function to get X with additional features 
    ''' 
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1     x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1     xx_1, xx_2 = np.meshgrid(np.arange(x1_min, x1_max, h),                          np.arange(x2_min, x2_max, h)) 
     if X.shape[1] == 2: 
        Z = svm_cls.predict(np.c_[xx_1.ravel(), xx_2.ravel()])     else: 
        X_conc = np.c_[xx_1.ravel(), xx_2.ravel()] 
        X_new = feature_new(X_conc) 
        Z = svm_cls.predict(X_new) 
 
    Z = Z.reshape(xx_1.shape) 
     
    f, ax = plt.subplots(nrows=1, ncols=1,figsize=(5,5))     plt.contourf(xx_1, xx_2, Z, cmap=plt.cm.coolwarm, alpha=0.8)     plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)     plt.xlabel('X_1')     plt.ylabel('X_2') 
    plt.xlim(xx_1.min(), xx_1.max())     plt.ylim(xx_2.min(), xx_2.max())     plt.xticks(())     plt.yticks(()) 
 
    plt.show() 
In [38]: # DO NOT CHANGE 
# Try to fit a linear classifier to the dataset from sklearn import svm 
from sklearn.metrics import accuracy_score svm_cls = svm.LinearSVC() svm_cls.fit(X_train, y_train) 
y_test_predicted = svm_cls.predict(X_test) 
 
print("Accuracy on test dataset: {}".format(accuracy_score(y_test,                                                             y_test_predic ted)))  visualize_decision_boundary(X_train, y_train)
Accuracy on test dataset: 0.425 
 
We can see that we need a non-linear boundary to be able to successfully classify data in this dataset. By mapping the current feature x to a higher space with more features, linear SVM could be performed on the features in the higher space to learn a non-linear decision boundary. In the function below add additional features which can help classify in the above dataset. After creating the additional features use code in the further cells to see how well the features perform on the test set.
Note: You should get an accuracy above 95%
Hint: Think of the shape of the decision boundary that would best separate the above points. What additional features could help map the linear boundary to the non-linear one? Look at this
(https://xavierbourretsicotte.github.io/Kernel_feature_map.html) for a detailed analysis of doing the same for points separable with a circular boundary
In [39]: # DO NOT CHANGE from feature import create_nl_feature 
 
X_new = create_nl_feature(X) 
X_train, X_test, y_train, y_test = train_test_split(X_new, y,                                                      test_size=0.20,                                                      random_state=random_ state) 
In [40]: # DO NOT CHANGE 
# Fit to the new features and vizualize the decision boundary 
# You should get more than 90% accuracy on test set 
 
svm_cls = svm.LinearSVC() svm_cls.fit(X_train, y_train) 
y_test_predicted = svm_cls.predict(X_test) 
 
print("Accuracy on test dataset: {}".format(accuracy_score(y_test, y_tes t_predicted))) 
 visualize_decision_boundary(X_train, y_train, create_nl_feature) 
Accuracy on test dataset: 0.975 

More products