Starting from:

$30

DataAnalytics-3333 Assignment 4 -Solved

The following figure shows a neural network with two inputs, one hidden layer with two hidden neurons 
and one output. (For simplicity, we omit the intercept terms here). We initialize the parameters as follows: 
w11 = 0.1 w12 = 0.4, w21 = −0.1, w22 = −0.1,v11 = 0.06, v12 = −0.4. Given one observation x1 = 1, and 
x2 = 0, and the observed ouput t1 = 0, update the network parameter w11, using the learning rate λ = 0.01. 
1Solution 
3333 Assignment 4 Ravish Kamath 213893664 
Solution 
net1 = 0.1(x1) + 0.4(x2) 
y1 = f(net1) = 
e0.1 
1 + e0.1 
= 0.52 
= 0.1(1) + 0.4(0) 
f′(net1) = 0.52(1 − 0.52) 
= 0.1 = 0.25 
net1 = −0.1(x1) − 0.1(x2) 
y1 = f(net2) = 
e−0.1 
1 + e−0.1 
= 0.48 
= −0.1(1) + (−0.1)(0) 
f′(net2) = 0.48(1 − 0.48) 
= −0.1 = 0.25 
net∗
1 = 0.06(y1) + (−0.4)(y2) 
f′(net1∗) = 0.46(1 − 0.46) 
= 0.06(0.52) + (−0.4)(0.48) = 0.25 
= −0.16 
error = t1 − z1 = 0 − 0.46 
= −0.46 
∂J 
∂v11 
= −(t1 − z1)f′(net∗
1)y1 
∂J 
∂w11 
= −(t1 − z1)f′(net∗
1)v11f′(net1)x1 
= −(−0.46)(0.25)(0.52) = −(−0.46)(0.25)(0.06)(0.25)(1) 
= 0.0598 = 0.001725 
v
(new) 
11 
= v
(old) 
11 
− λ 
∂J 
∂v
11 
= 0.06 − (0.01)(0.0598) 
= 0.06 
w
(new) 
11 
= w
(old) 
11 
− λ 
∂J 
∂w
11 
= 0.1 − (0.01)(0.001725) 
= 0.09998275 
23333 Assignment 4 Ravish Kamath 213893664 
Question 2 
Principal compnent method can be used to summarize the data in a lower dimension. Suppose each observation 
in the data set Xi has two features Xi1, and Xi2. We wish to use the principal compnent method to present 
the data in one dimension space. We have the following data set. 
X = 







 
−3 6 

6 6 

8 3
.5 

7 6 

7 5 

9 6 







 
Calculate the first principal component for the first observation. 
Solution 
X = c(-3, -6, -8, -7, -7, -9, 6, 6, 3.5, 6, 5, 6) 
X = matrix(X, ncol = 2) 
Xbar = t(colMeans(X)) 
Xbar 
## [,1] [,2] 
## [1,] -6.666667 5.416667 
Xstar = apply(X, 2, scale, scale=FALSE, center=TRUE) 
Xstar 
## [,1] [,2] 
## [1,] 3.6666667 0.5833333 
## [2,] 0.6666667 0.5833333 
## [3,] -1.3333333 -1.9166667 
## [4,] -0.3333333 0.5833333 
## [5,] -0.3333333 -0.4166667 
## [6,] -2.3333333 0.5833333 
covmat = cov(Xstar) 
eigen_v = eigen(covmat) 
w = eigen_v$vectors 

## [,1] [,2] 
## [1,] -0.9773142 0.2117948 
## [2,] -0.2117948 -0.9773142 
In the first column egienvector, we have the largest eigenvalue to be 4.4255881 vs the 2nd column has the 
eigenvalue 0.8827452. We will choose the eigenvector witht the largest eigenvalue since it will have the largest 
variance. 
y = w[,1]%*%Xstar[1,] 

## [,1] 
## [1,] -3.707032 
As we can see from the above code, we have calculated the first principal component for the first observation 
to be -3.707032. 
33333 Assignment 4 Ravish Kamath 213893664 
Question 3 
ID3(S,A) is an important algorithm in the construction of decision tree.The set S denote the collection of 
observations. The set A denote the collection of predictors. In this question, let A = {X1, X2}. Let S be the 
following data set: 
S = 







 
Y X1 X2 
1 1 1 
1 0 1 
0 0 1 
0 0 0 
1 1 0 







 
We would like to build a classification tree for the response variable Y . 
• What is the misclassification error rate if we do a majority vote for Y without splitting X1 or X2? 
• What is the misclassification error rate if we split the data set based on X1 = 1 versus X1 = 0? What 
is the misclassification error rate if we split the data set based on X2 = 1 versus X2 = 0? 
• Should we split the tree based on the predictor X1 or X2 or not split the tree? 
• Decision tree is very sensitive to the data set. If there are small changes in the data set, the resulting 
tree can be very different. Ensemble method can overcome this problem and improve the performance 
of the decision tree? Use two or three sentences to describe what ensemble method is and name three 
ensemble methods that can used to improve decision trees. 
Solution 
• 
Error = C(Ps(y = 1)) 
= C(
3
5


2

• 
Error = PS1 (X1 = 1)C(PS1 (Y = 1|X1 = 1)) + PS2 (X1 = 0)C(PS2 (Y = 1|X1 = 1)) 

2

· C(
2

) + 
3

· C(
1
3


2

· 0 + 
3

· 
1


1

Eror = PS1 (X2 = 1)C(PS2 (Y = 1|X2 = 1)) + PS2 (X2 = 0)C(PS2 (Y = 1|X2 = 0)) 

3

· C(
2

) + 
2

· C(
1
2


3

· 
1


2

· 
1


2

• We should split the tree based on the predicto X1 
• The ensemble method is combing different base classifiers togethe using the majority vote. It can utilize 
the strengths of all the methods and mitigate thei limitations. Each base classifier must be different. 
Three examples of base ensemble methods are: bagging via bootstrap, boosting and random forest. 
43333 Assignment 4 Ravish Kamath 213893664 
Question 4 
One of the hierarchical cluster algorithms is agglomerative (bottom up) procedure. The procedure starts 
with n singleton clusters and form hierarchy by merging most similar clusters until all the data points are 
merged into one single cluster. Let the distance between two data points be the Euclidean distance d(x, y) = 
p (x1 − y1)2 + . . . + (xd − yd)2. Let the distance between two clusters A and B be minx∈A,y∈Bd(x, y), the 
minumum distance between the points from the two clusters. THere are 5 observations a, b, c, d and e. Their 
Euclidean distances are given in the following matrix: 







 
a b c d e 
0 4 3 6 11 
4 0 5 7 10 
3 5 0 9 2 
6 7 9 0 13 
11 10 2 13 0 


For example, based on the matrix above, the distance between a and b is 4. Please derive the four steps in 
the agglomerative clustering procedure to construct the hierarchical clustering for the dataset. For each step, 
you need to specify which two clusters are merged and why you choose these two to merge. 
Solution 

 
a b c d e 

4 0 
3 5 0 
6 7 9 0 
11 10 

13 0 


 
From the above matrix, we see the smallest number is 2 which is in the minimum distance which would be 
the (ce). 
d(c, a) = 3 
d(e, a) = 1 
d(d(c, e), d(e, a)) = min(3, 11) = 3 
d(c, b) = 5 
d(e, b) = 10 
d(d(c, b), d(e, b)) = min(5, 10) = 5 
d(c, d) = 9 
d(e, d) = 13 
d(d(c, d), d(e, d)) = min(9, 13) = 9 



(ce
) a b d 
0


5 4 0 
9 6 7 0 

 
From the above matrix, we see the smallest number is 3 which is in the minimum distance which would be 
the (ace). 
d(ce, b) = 5 
d(a, b) = 4 
d(d(ce, b), d(a, b)) = min(5, 4) = 4 
d(ce, d) = 9 
d(a, d) = 6 
d(d(ce, d), d(a, d)) = min(9, 6) = 6 
5Solution 
3333 Assignment 4 Ravish Kamath 213893664 




(ace

b d 
0


6 7 0 



 
From the above matrix, we see the smallest number is 4 which is in the minimum distance which would be 
the (aceb). 
d(ace, d) = 6 
d(b, d) = 7 
d(d(ace, d), d(b, d)) = min(6, 7) = 6 


(aceb
) d 
0



 
63333 Assignment 4 Ravish Kamath 213893664 
Question 5 
Analyze the German data set from the site: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+ 
data). Apply the support vector machine analysis and the random forest analysis on the dataset. Please 
randomly select 800 obervations as the training set and use your two models to predict the default status 
of the remaining 200 loans. Repeat this cross-validation one thousand times and calculate the avergae 
misclassification errors of the two models. 
Solution 
set.seed(1) 
n = nrow(germandata) 
nt = 800 
rep = 1000 
error_SVM = dim(rep) 
error_RF = dim(rep) 
neval = n - nt 
germandata$Default = factor(germandata$Default) 
for (i in 1: rep) { 
training = sample(1:n, nt) 
trainingset = germandata[training,] 
testingset = germandata[-training,] 
# SVM Analysis 
x = subset(trainingset, select = c('duration', 'amount', 'installment', 'age')) 
y = trainingset$Default 
xPrime = subset(testingset, select = c('duration', 'amount', 'installment', 'age')) 
yPrime = testingset$Default 
svm_model1 = svm(x,y) 
pred_SVM = predict(svm_model1, xPrime) 
tableSVM = table(yPrime, pred_SVM) 
error_SVM[i] = (neval - sum(diag(tableSVM)))/neval 
#Random Forest Analysis 
rf_classifier = randomForest(Default ~., data = trainingset, type = classification, 
ntree = 100, mtry = 2, importance = TRUE) 
prediction_RF = predict(rf_classifier, testingset) 
table_RF = table(yPrime, prediction_RF) 
error_RF[i] = (neval - sum(diag(table_RF)))/neval 

mean(error_SVM) 
mean(error_RF) 
Because we are repeating the cross validations a 1000 times, I decided to leave the final number out as it 
takes a while to run it. However running it through Rstudio, we got the average misclassification error for 
SVM to be 0.282915 and for random forest to be 0.24224. 
73333 Assignment 4 Ravish Kamath 213893664 
Question 6 
The idea of support vector machine (SVM) is to maximize the distance of the separating plane to the closest 
observation which are referred as the support vectors. Let g(x) = w0 + w1x1 + w2x2 = 0 be the separating 
line. For a given sample x = (x1, x2), the distance of x to the straight line g(x) = 0, is 
|w0 + w1x1 + w2x2| 
p w2
1 + w
2

• Let the separating line be x1 + 2x2 − 3 = 0, and the given observation is x = (1.5, 1.5). Calculate the 
distance of the observation to the separating line. 
• In the linear SVM, the dot product xT
i xj is an important operation which facilitates the calculation 
of the Euclidean distance. Let the nonlinear mapping of the sample from the original space to the 
projected space by ϕ. In nonlinear SVM, the dot product between the images of the mapping ϕ(xi) 
and ϕ(xj ) are calculated by the kernel function K(xi , xj ) = ϕ(xi)T ϕ(xj ). Suppose in the original 
space xi = (xi1, xi2

and 
xj 
= (
xj1, xj2). The nonlinear mappings are 
ϕ
(
xi) = (x2i1 , x2i2, 
√2xi1xi2) and 
ϕ(xj ) = (x2j1 , x2j2, 

2
xj1, x
j2
). Calculate the kernel function 
K(xi , xj 
)
. If it is a polynomial kernel 
function, determine the degrees of the polynomial kernel function. 
Solution 
• 
|w0 + w


w
2x2| 
p w
2


w
2

= | − 3 + 1
.5 + 2(1
.5)| 
p (1)2 
+ (2)2 

3√5 
10 
• 
K(xi , xj ) = ϕ(xi)T ϕ(xj ) 
= (x2i1 , x2i2), √2xi1xi2)T (x2j1 , x2j2, √2xj1xj2) 
= x2i1x2j1 + x

ii
2x2j2 + 2xi1xi2xj1xj2 
= (xi1xj1 + xi2xj2)2 
= (xT
i xj )2 
83333 Assignment 4 Ravish Kamath 213893664 
Question 7 
You don’t need to submit this question on Crowdmark. This question is only for your practice. In the 
following table we have the playlist of 10 Spotify users. There are 5 artists A, B, C, D and E. If th user 
chooses the artist, the corresponding entry will be 1, otherwise, it will be zero. 
obs A B C D E 
1 1 1 0 1 1 
2 1 0 1 1 0 
3 0 1 1 1 0 
4 0 1 1 0 0 
5 0 1 1 0 1 
6 1 0 0 0 1 
7 1 1 1 1 1 
8 0 1 1 1 0 
9 0 0 1 1 1 
10 1 0 1 1 1 
• Suppose A is the antecendent and B is the consequent. Calculate the confidence of B and the lift of A 
on B. Based on the lift value, do you recommend B to the user after the user has played artist A? Why? 
Solution 
9

More products