$29.99
Question No. 1 2 3 4 5 6 7 8
Score 15 10 10 10 10 20 20 5
This exam paper contains 8 questions and the score is 100 in total (Please hand in your answer sheet in the digital form).
Problem I. Least Square (15 points)
a) Consider Y = AX + V and V N (v|0, Q), what is the least square solution of X ?
b) If there is a constraint of bX = c, what is the optimal solution of X?
c) If there is an additional constraint of XTX = d, in addition to the constraint in b), what is the
optimal solution of X?
d) If both A and X are unknown, how to solve A and X alternatively by using two constraints
of XTX = d and Trace(ATA) = e?
Problem II. Linear Gaussian System (10 points)
Consider Y = AX + V, where X and V are Gaussian, X N (x|m0, 0), V N (v|0, -1I). What are the conditional distribution, pY X( | ) , the joint distribution 𝑝 = (𝑌, 𝑋) , the marginal distribution, pY( ) , the posterior distribution, p(X|Y = y, 𝛽 , m0, 0), and the
posterior predictive distribution, 𝑝(𝑌̂|𝑌 = 𝒚, 𝛽, 𝒎0,0), respectively?
Problem III. Linear Regression (10 points)
Consider y = wT(x) + v, where v is Gaussian, i.e., v N (v|0, -1), and w has a Gaussian priori, i.e., w N (w|𝒎0 , -1I). Assume that (x) is known, please derive the posterior distribution and posterior predictive distribution, 𝑝(𝒘|𝐷, 𝛽, 𝒎0,) and 𝑝(𝑦̂|𝐷, 𝛽, 𝒎0,),
respectively, where D = {𝜙𝑛,𝑦𝑛} is the training data set and 𝜙𝑛 = 𝜙(𝐱𝑛).
Problem IV. Logistics Regression (10 points)
Consider a two-class classification problem with the logistic sigmoid function, 𝑦 = (𝒘T (𝒙)), for a given data set D = {𝜙𝑛,𝑡𝑛}, where 𝑡𝑛{0, 1}, 𝜙𝑛 = 𝜙(𝐱𝑛), n = 1,…, N,
and the likelihood function is given by
𝑁
𝑝(𝒕|𝒘) = ∏ 𝑦𝑛𝑡𝑛(1 − 𝑦𝑛)1−𝑡𝑛
𝑛=1
where w has a Gaussian priori, i.e., w N (w|m0, -1I). Please derive the posterior distribution and posterior predictive distributions, 𝑝(𝒘|𝐷, 𝒎0,) and 𝑝(𝑡|𝐷, 𝒎0,) ,
respectively. (Hint: using Laplace approximation).
Problem V. Neural Network (10 points)
Consider a two-layer neural network described by following equations:
𝑎1 = 𝒘(1)𝒙, 𝑎2 = 𝒘(2)𝒛, 𝑧 = ℎ(𝑎1), 𝑦 = 𝜎(𝑎2)
where x and y are the input and output, respectively, of the neural network, ℎ(•) is a nonlinear function, and 𝜎(•) is the sigmod function.
𝜕𝑦 𝜕𝑦 𝜕𝑦 𝜕𝑦 𝜕𝑦
(1) Please derive the following gradients: 𝜕 (𝟏) , 𝜕𝒘(2), 𝜕𝑎1 , 𝜕𝑎2, and 𝜕𝒙.
𝐰
(2) Please derive the updating rules for w(1) and w(2) given the classification errors between y and t, where t is the ground truth of the output y.
Problem VI. Bayesian Neural Network (20 points)
a) Consider a neural network for regression, t = y (w, x) + v, where v is Gaussian, i.e., v N (v|0, -1), and w has a Gaussian priori, i.e., w N (w|𝒎0, -1I). Assume that y (w, x) is the neural network output please derive the posterior distribution, 𝑝(𝒘|𝐷,, 𝒎0,), and the posterior predictive distribution, 𝑝(𝑡|𝐷,, 𝒎0,), where 𝐷 = {𝒙, 𝑡}.
b) Consider a neural network for two-class classification, y = (f(w, x)) and a data set {xn, tn}, where tn 0,1, w has a Gaussian priori, i.e., w ,N (w|0, -1I), and f(w, x) is the
neural network model. Please derive the posterior and posterior predictive distributions,
𝑝(𝒘|𝐷, 𝛼) and 𝑝(𝑡|𝐷,), respectively, where 𝐷 = {𝒙,𝑡}
Problem VII. Critical Analyses (20 Points)
a) Please explain why the dual problem formulation is used to solve the SVM machine learning problem.
b) Please explain, in terms of cost functions, constraints and predictions, i) what are the
differences between SVM classification and logistic regression; ii) what are the differences between -SVM regression and least square regression.
c) Please explain why neural network (NN) based machine learning algorithms use logistic activation functions ?
d) Please explain i) what are the differences between the logistic activation function and other
activation functions (e.g., relu, tanh); and ii) when these activation functions should be used.
e) Please explain why Jacobian and Hessian matrices are useful for machine learning algorithms.
f) Please explain why exponential family distributions are so common in engineering practice. Please give some examples which are NOT exponential family distributions.
g) Please explain why KL divergence is useful for machine learning? Please provide two examples of using KL divergence in machine learning.
h) Please explain why data augmentation techniques are a kind of regularization skills for NNs.
i) Please explain why Gaussian distributions are preferred over other distributions for many machine learning models?
j) Please explain why Laplacian approximation can be used for many cases?
k) What are the fundamental principles for model selection (degree of complexity) in machine learning?
l) How to choose a new data sample (feature) for regression and classification model training, respectively? How to choose it for testing? Please provide some examples.
m) Please explain why the MAP model is usually more preferred than the ML model?
Problem VIII. Discussion (5 Points)
What are the generative and discriminative approaches to machine learning, respectively? Can you explain the advantages and disadvantages of these two approaches and provide a detailed example to illustrate your points?