$30
1. A matrix A over R is called positive semidefinite (PSD) if for every vector v, vTAv ≥ 0. Show that a symmetric matrix A is PSD if and only if it can be written as A = XXT .
Hint: a real symmetric matrix A can be decomposed as A = QT DQ, where Q is an orthogonal matrix and D is a diagonal matrix with eigenvalues of A as its diagonal elements.
2. Show that the set of all symmetric positive semidefinite matrices is convex: Namely, that for any two symmetric positive semidefinite matrices A,B and a scalar 0 ≤ θ ≤ 1, θA+(1−θ)B is also symmetric positive semidefinite.
Calculus and Probability
1. Matrix calculus is the extension of notions from calculus to matrices and vectors. We define the derivative of a scalar y with respect to a vector x as the column vector which obeys:
for i = 1,...,n. Let A be a n × n matrix. Prove that: x
2. Let p = (p1,...,pn) be a discrete distribution, with = 1 and pi ≥ 0 for i = 1,...,n.
The entropy, which measures the uncertainty of a distribution, is defined by:
n
H(p) = −Xpi logpi
i=1
where we define 0log0 = 0. Use Lagrange multipliers to show that the uniform distribution has the largest entropy (Tip: Ignore the inequality constraints pi ≥ 0 and show that the solution satisfies them regardless).
3. Let X0,...,Xn−1 be n positive (∀i,Xi ∈ [0,∞)) i.i.d random variables with a continuous probability function fX. We will prove the following:
We will start with a side lemma that will help us with the proof.
(a) Prove the following lemma: .
Where FX(a) is the cumulative distribution function (CDF) of X0 at point a and fX0(a) is the probability density function (PDF) of X0 at a.
(b) Recall that the CDF is an integral over the probability density function (f is continuous) and use this fact to complete the above theorem.
1
2 Handout Homework 1: March 8, 2020
Decision Rules and Concentration Bounds
1. Let X and Y be random variables where Y can take values in {1,...,L}. Let `0−1 be the 0-1 loss function defined in class. Show that h = arg min E[`0−1(f(X),Y )] is given by f:X→R
2. Let X = (X1,...,Xn)T be a vector of random variables. X is said to have a multivariate normal (or Gaussian) distribution with mean µ ∈ Rn and a n × n positive definite covariance matrix Σ, if its probability density function is given by
f(x;µ,
where E[Xi] = µi and cov(Xi,Xj) = Σij for all i,j = 1,...,n. We write this as X ∼ N(µ,Σ).
In this question, we generalize the decision rule we have seen in the recitation to more than one dimension. Assume that the data is hx,yi pairs, where x ∈ Rd and y ∈ {0,1}. Denote by f0(x) and f1(x) the probability density functions of x given each of the label values. It is known that f0,f1 are multivariate Gaussian:
f0(x) = f(x;µ0,Σ) f1(x) = f(x;µ1,Σ)
Note that the covariance matrix, Σ, is the same for both distributions, but the mean vectors, µ0,µ1, are different. Finally, it is known that the probability to sample a positive sample (i.e. y = 1) is p.
(a) We are given a point x and we need to label it with either y = 0 or y = 1. Suppose our decision rule is to decide y = 1 if and only if P[y = 1|X] P[y = 0|X]. Find a simpler condition for X that is equivalent to this rule.
(b) The decision boundary for this problem is defined as the set of points for which P[y = 1|X] = P[y = 0|X]. What is the shape of the decision a general d 1 (think of d = 1 and d = 2 for intuition)?
3. Let X1,...,Xn be i.i.d random variables that are uniformly distributed over the interval [−3,5].
Define S = X1 + ... + Xn. Use Hoeffding’s inequality to find N ∈ N such that for all n ≥ N
P[S n2 + 0.2n] < 0.1
4. Suppose we need to build a load balancing device to assign a set of n jobs to a set of m servers. Suppose the j-th job takes Lj time, 0 ≤ Lj ≤ 1 (say, in seconds). The goal is to assign the n jobs to the m servers so that the load is as balanced as possible (i.e., so that the busiest server finishes as quickly as possible). Suppose each server works sequentially through the jobs that are assigned to it and finishes in time equal to the sum of job lengths assigned to the server. Let be the total sum of job lengths (assume L m). With perfect load balancing, each server would take L/m time. There are some good algorithms for this scenario, but we are interested in analyzing the case of random assignment of jobs to servers.
Handout Homework 1: March 8, 2020 3
Suppose we assign a random server for each job, with replacement. Denote by Ri,j the load on server i from job j - that is, Lj if server i was assigned for job j, or 0 otherwise. Also, let be the total load on server i.
(a) What is E[Ri]?
(b) We want to bound the probability that the load on the i-th server is more than δ = 10% larger than the expected load. Use the multiplicative form of the Chernoff bound to bound
P[Ri ≥ (1 + δ) · E[Ri]]
Note that this form doesn’t require the summed random variables to be identically distributed.
(c) Now, we want to bound the probability that any of the servers are overloaded by more than δ = 10% of the expected load. Give a bound for:
or ... or
using the results from (a) and (b) and using the union bound (reminder: for events
A1,...,Ak, the union bound is
4 Handout Homework 1: March 8, 2020
Programming Assignment
1. Nearest Neighbor. In this question, we will study the performance of the Nearest Neighbor (NN) algorithm on the MNIST dataset. The MNIST dataset consists of images of handwritten digits, along with their labels. Each image has 28×28 pixels, where each pixel is in grayscale scale, and can get an integer value from 0 to 255. Each label is a digit between 0 and 9. The dataset has 70,000 images. Althought each image is square, we treat it as a vector of size 784.
The MNIST dataset can be loaded with sklearn as follows:
from sklearn.datasets import fetch_openml
mnist = fetch_openml(’mnist_784’)
data = mnist[’data’]
labels = mnist[’target’]
Loading the dataset might take a while when first run, but will be immediate later. See http://scikit-learn.org/stable/datasets/mldata.html for more details. Define the training and test set of images as follows:
import numpy.random
idx = numpy.random.RandomState(0).choice(70000, 11000)
train = data[idx[:10000], :].astype(int)
train_labels = labels[idx[:10000]]
test = data[idx[10000:], :].astype(int)
test_labels = labels[idx[10000:]]
It is recommended to use numpy and scipy where possible for speed, especially in distance computations.
(a) Write a function that accepts as input: (i) a set of images; (ii) a vector of labels,corresponding to the images (ii) a query image; and (iii) a number k. The function will implement the k-NN algorithm to return a prediction of the query image, given the given label set of images. The function will use the k nearest neighbors, using the Euclidean L2 metric. In case of a tie between the k labels of neighbors, it will choose an arbitrary option.
(b) Run the algorithm using the first n = 1000 training images, on each of the test images, using k = 10. What is the accuracy of the prediction (measured by 0-1 loss; i.e. the percentage of correct classifications)? What would you expect from a completely random predictor?
(c) Plot the prediction accuracy as a function of k, for k = 1,...,100 and n = 1000. Discuss the results. What is the best k?
(d) Using k = 1, run the algorithm on an increasing number of training images. Plot the prediction accuracy as a function of n = 100,200,...,5000. Discuss the results.