Starting from:

$24.99

Machine Learning Exercise 6 Solution

Machine Learning Course
1 Convexity


Recall that we say that a function f is convex if the domain of f is a convex set and
f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y), for all x,y in the domain of f, 0 ≤ θ ≤ 1.
And strictly convex if
f(θx + (1 − θ)y) < θf(x) + (1 − θ)f(y), for all x ≠ y in the domain of f, 0 < θ < 1.
Prove the following assertions.
1. The affine function f(x) = ax + b is convex, where a,b and x are scalars.
2. If multiple functions fn(x) are convex over a fixed domain, then their sum g(x) = Pn fn(x) is convex over the same domain.
3. Take f,g : R → R to be convex functions and g to be increasing. Then the function g ◦ f defined as (g ◦ f)(x) = g(f(x)) is also convex.
Note: A function g is increasing if a ≥ b ⇔ g(a) ≥ g(b). An example of a convex and increasing function is exp(x),x ∈ R.
4. If f : R → R is convex, then g : RD → R, where g(x) := f(w⊤x + b), is also convex. Here, w is a constant vector in RD, b is a constant in R and x ∈ RD.
5. Let f : RD → R be strictly convex. Let x⋆ ∈ RD be a global minimizer of f. Show that this global minimizer is unique. Hint: Do a proof by contradiction.
2 Extension of Logistic Regression to Multi-Class Classification
Suppose we have a classification dataset with N data example pairs {xn,yn}, n ∈ [1,N], and yn is a categorical variable over K categories, yn ∈ {1,2,...,K}. We wish to fit a linear model in a similar spirit to logistic regression, and we will use the softmax function to link the linear inputs to the categorical output, instead of the logistic function.
We will have K sets of parameters wk, and define ηnk = w⊤k xn and compute the probability distribution of the output as follows,
.
Note that we indeed have a probability distribution, as . To make the model identifiable, we will fix wK to 0, which means we have K−1 sets of parameters to learn. As in logistic regression, we will assume that each yn is i.i.d., i.e.,
.
1. Derive the log-likelihood for this model.
Hint: It might be helpful to use the indicator function 1yn=k, that is equal to one if yn = k and 0 otherwise.
2. Derive the gradient with respect to each wk.
3. Show that the negative of the log-likelihood is jointly convex in w1,...,wK. Hint: you can use H¨older’s inequality:
,
where .
3 Mixture of Linear Regression
If you have a regression dataset with two or more distinct clusters, a mixture of linear regression models is preferred over just one linear regression model.
Consider a regression dataset with N pairs {yn,xn}. Similar to Gaussian mixture model (GMM), let rn ∈ {1,2,...,K} index the mixture component. The distribution of the output yn under the kth linear model is defined as follows:

Here, wk is the regression parameter vector for the kth model with w being a vector containing all wk. Also, denote x˜ .
2. Write the expression for the joint distribution p(y|X,w,r) where r is the set of all r1,r2,...,rN.
3. Assume that rn follows a multinomial distribution p(rn = k|π) = πk, with π = [π1,π2,...,πK]. Derive the marginal distribution p(yn|xn,w,π) obtained after marginalizing rn out.
4. Write the expression for the maximum likelihood estimator L(w,π) := −logp(y|X,w,π) in terms of data y and X, and parameters w and π.
5. (a) Is L jointly convex with respect to w and π?
(b) Is the model identifiable? That is, can the MLE estimator return the true parameters w and π, assuming we have infinitely many samples. Prove your answers.
2

More products