Starting from:

$20

EN.601.682-Deep Learning Homework 1 Solved

1.    In this exercise you are going to derive the well-known sigmoid expression for a Bernoulli distributed (binary) problem. The probability of the ”positive” event occurring is p. The probability of the ”negative” event occurring is q = 1 − p.

(a)     What are the odds o of the ”positive” event occurring? Please express the result using p only.

In statistics, the logit of the probability is the logarithm of the corresponding odds, i.e. logit(p) = log(o).

(b)     Given logit(p) = x, please derive the inverse function logit−1(x). Please express the result using x only.

The inverse function of the logit in (b) is actually the sigmoid function S(x). You may already have noticed that the probability p = logit−1(x) = S(x). This means that the range of the sigmoid function is the same as the range of a probability, i.e. (0,1). The domain of the sigmoid function is (−∞,∞). Therefore, the sigmoid function maps all real numbers to the interval (0,1).

(c)      Now we look into the saturation of the sigmoid function. Calculate the value of the sigmoid function S(x) for x = ±100,±10, and 0. Round the results to two decimal places.

(d)     Calculate the derivatives of the sigmoid function S0(x) and the value of S0(x) for x = ±100,±10, and 0. Round the results to two decimal places.

You may have noticed that S(±100) is very close to S(±10); the derivatives at x = ±100 and x = ±10 are very close to zero. This is the saturation of the sigmoid function when |x| is large. The saturation brings great difficulty in training deep neural networks. This will reappear in later lectures.

2.    Recall in class, we learned the form of a linear classifier as f(x;W) = Wx + b. We will soon learn, that iteratively updating the weights in negative gradient direction will allow us to slowly move towards an optimal solution. We will call this technique backpropagation. Obviously, computing gradients is an important component of this technique. We will investigate the first derivative of a commonly used loss function: the softmax loss. Here, we consider a multinomial (multiple classes) problem. Let’s first define the notations:

input features :
x ∈ RD.
target labels (one-hot encoded) :
y ∈ {0,1}K.
multinomial linear classifier :
f = Wx + b, W ∈ RK×D and f,b ∈ RK
e.g., for the k-th classification :
fk = wkTx + bk, corresponding to yk,
where wkT is the k-th row of W,k ∈ {1...K}

1

(a)     Please express the softmax loss of logistic regression, L(x,W,b,y) using the above notations.

(b)     Please calculate its gradient derivative  .

3.    In class, we briefly touch upon the Kullback-Leibler (KL) divergence as another loss function to quantify agreement between two distributions p and q. In machine learning scenarios, one of these two distributions will be determine by our training data, while the other is being generated as output of our model. The goal of training our model is to match these two distributions as well as possible. KL divergence is asymmetric, so that assigning these distributions to p and q will matter. Here, you will investigate this difference by calculating the gradient. The KL divergence is defined as

KL 

(a)     Show that KL divergence is asymmetric using the following example. We define a discrete random variable X. Now consider the case that we have two sampling distributions P(x) and Q(x), which we present as two vectors that express the frequency of event x:

P(x) = [1, 6, 12, 5, 2, 8, 12, 4]

Q(x) = [1, 3, 6, 8, 15, 10, 5, 2]

Please compute 1) the probability distribution, p(x) and q(x) (hint: calculate the normalization); and 2) both directions of KL divergence, KL(p||q) and KL(q||p).

(b)     Next, we try to optimize the weights W of a model in an attempt to minimize KL divergence. As a consequence, q = qW now depends on the weights. Please express KL(qW||p) and KL(p||qW) as optimization objective functions. Can you tell which direction is easier for computation? To find out, please look back at the original expression of KL(qW||p) and KL(p||qW) and see which terms can be grouped to be a constant. This constant can be thus cancelled out when calculating the gradient. Then, please also calculate the gradient of KL(qW||p) and KL(p||qW) w.r.t. qW(d), the d-th element of qW.

4.    In this problem, you are provided an opportunity to perform hands-on calculation of the SVM loss and softmax loss we learned in class.

We define a linear classifier:

f(x,W) = Wx + b

and are given a data sample:

x  .

Assume that the weights of our model are given by

W   .

Please calculate 1) SVM loss (hinge loss) and 2) softmax loss (cross-entropy loss) of this sample. Use the natural log.

2

More products