Starting from:

$30

CS7643-Problem Set 0 Solved

1         Multiple Choice Questions
1.    ) true/false We are machine learners with a slight gambling problem (very different from gamblers with a machine learning problem!). Our friend, Bob, is proposing the following payout on the roll of a dice:

                                                                             payout                                          (1)

where x ∈ {1,2,3,4,5,6} is the outcome of the roll, (+) means payout to us and (−) means payout to Bob. Is this a good bet i.e are we expected to make money?

                True              False

2.    X is a continuous random variable with the probability density function:

                                                                                                        (2)

Which of the following statements are true about equation for the corresponding cumulative density function (cdf) C(x)?

[Hint: Recall that CDF is defined as C(x) = Pr(X ≤ x).]

 

  All of the above

  None of the above

3.    A random variable x in standard normal distribution has following probability density

                                                                                                                                         (3)

Evaluate following integral

                                                                                                                        (4)

[Hint: We are not sadistic (okay, we’re a little sadistic, but not for this question). This is not a calculus question.]

                  a + b + c               c              a + c               b + c

4.    Consider the following function of x = (x1,x2,x3,x4,x5,x6):

                                                         (5)

where σ is the sigmoid function

                                                                                                                                            (6)

Compute the gradient ∇xf(·) and evaluate it at at xˆ = (5,−1,6,12,7,−5).

 

5.    Which of the following functions are convex?

 

 x for x ∈ Rn   for w ∈ Rd

  All of the above

6.    Suppose you want to predict an unknown value Y ∈ R, but you are only given a sequence of noisy observations x1...xn of Y with i.i.d. noise ( ).. If we assume the noise is I.I.D. Gaussian ( ), the maximum likelihood estimate (yˆ) for Y can be given by:

 = argmin 

 = argmin 

 

  Both A & C

  Both B & C

2         Proofs
7.    Prove that

                                                                               loge x ≤ x − 1,             ∀x 0                                        (7)

with equality if and only if x = 1.

[Hint: Consider differentiation of log(x) − (x − 1) and think about concavity/convexity and second derivatives.]


8.    ) Consider two discrete probability distributions p and q over k outcomes:

k k X X

                                                                                                          pi =           qi = 1                                 (8a)

                                                                                                   i=1                 i=1

                                                                           pi 0,qi 0,            ∀i ∈ {1,...,k}                                    (8b)

The Kullback-Leibler (KL) divergence (also known as the relative entropy) between these distributions is given by:

        (9) It is common to refer to KL(p,q) as a measure of distance (even though it is not a proper metric). Many algorithms in machine learning are based on minimizing KL divergence between two probability distributions. In this question, we will show why this might be a sensible thing to do.

[Hint: This question doesn’t require you to know anything more than the definition of KL(p,q) and the identity in Q7]

(a)     Using the results from Q7, show that KL(p,q) is always non-negative.


(b)    When is KL(p,q) = 0?


(c)     Provide a counterexample to show that the KL divergence is not a symmetric function of its arguments: KL(p,q) 6= KL(q,p)


9. In this question, you will prove that cross-entropy loss for a softmax classifier is convex in the model parameters, thus gradient descent is guaranteed to find the optimal parameters. Formally, consider a single training example (x,y). Simplifying the notation slightly from the implementation writeup, let

                                                                                             z = Wx + b,                                                    (10)

 (11)

(12)

Prove that L(·) is convex in W.

[Hint: One way of solving this problem is “brute force” with first principles and Hessians.

There are more elegant solutions.]

More products