Starting from:

$34.99

CSC2515 HOMEWORK 1 Solution

(a) [5 pts] Consider two independent univariate random variables X and Y sampled uniformly from the unit interval [0,1]. Determine the expectation and variance of the random variable Z = |X − Y |2, i.e., the squared distance between X and Y .
Note: You can either compute the integrals yourself or use the properties of certain probability distributions. In the latter case, explicitly mention what properties you have used.
(b) [5 pts] Now suppose we draw two d-dimensional points X and Y from a d-dimensional unit cube with a uniform distribution, i.e., X,Y ∈ [0,1]d. Observe that each coordinate is sampled independently and uniformly from [0,1], that is, we can view this as drawing random variables X1,...,Xd and Y1,...,Yd independently and uniformly from [0,1]. The squared Euclidean distance can be written as R = Z1 + ··· + Zd, where Zi = |Xi − Yi|2.
Using the properties of expectation and variance, determine ] and
and Var[Z] (the answers from part (a)).
(c) [5 pts] Based on your answer to part (b), compare the mean and standard deviation of to the maximum possible squared Euclidean distance between two points within the d-dimensional unit cube (this would be the distance between opposite corners of the cube). Why does this support the claim that in high dimensions, “most points are far away, and approximately have the same distance”?
2. Limiting Properties of the Nearest Neighbour Algorithm – 10 pts. In this question, we will study the limiting properties of the k-nearest neighbour algorithm.
Suppose that we are given n data points X1,...,Xn, sampled uniformly and independently from the interval [0,2] and one test point, y = 1.
Let Zi = |Xi − y| denote the distance of y to the data point Xi and Z(i) = |X(i) − y| denote the distance of y to its i-th nearest neighbour. Then, Z(1) < ··· < Z(n).
(a) [2 pts] Show that the random variable Zi is uniformly distributed between the unit interval
[0,1].
(b) [4 pts] Show that , i.e., the expected distance to the 1st nearest neighbour is .
Z1,...,Zn, that is, Z(1) = mini=1,...,n Zi. So
.
If mini=1,...,n Zi > t, what can you say about the value of Z1,...,Zn in relation to t? Do not forget to benefit from the independence of Zis.
(c) [2 pts] Determine the expected value of the random variable Z(k), that is, the expected distance to the k-th nearest neighbour.
Hint: You can use the fact that the density function of Z(k) is
.
(d) [2 pts] Based on your answer to part (c), what can you say about the expected distance to the k-th nearest neighbour as n → ∞ and
3. Information Theory – 15 pts. The goal of this question is to help you become more familiar with the basic equalities and inequalities of information theory. They appear in many contexts in machine learning and elsewhere, so having some experience with them is helpful. We review some concepts from information theory, and ask you a few questions.
Recall the definition of the entropy of a discrete random variable X with probability mass function p:
.
Here the summation is over all possible values of x ∈ X, which (for simplicity) we assume is finite. For example, X might be {1,2,...,N}.
(a) [3pt] Prove that the entropy H(X) is non-negative.
(b) [3pt] If X and Y are independent random variables, show that H(X,Y ) = H(X) + H(Y ) (c) [3pt] Prove the chain rule for entropy: H(X,Y ) = H(X) + H(Y |X).
An important concept in information theory is the relative entropy or the KL-divergence of two distributions p and q. It is defined as
KL .
If two distributions are close to each other, their KL divergence is small. If they are exactly the same, their KL divergence is zero. KL divergence is not a true distance metric, as it isn’t symmetric and doesn’t satisfy the triangle inequality, but we often use it as a measure of dissimilarity between two probability distributions.
(d) [3pt] Prove that KL(p||q) is non-negative.
(e) [3pt] The Information Gain or Mutual Information between X and Y is I(Y ;X) = H(Y )−
H(Y |X). Show that
I(Y ;X) = KL(p(x,y)||p(x)p(y)),
where p(x) is the marginal distribution of X and p(y) is the marginal distribution of Y .
4. Approximation Error in Decision Trees – 10 pts. We will study how we can use a decision tree to approximate a function and how the quality of the approximation improves as the depth increases.
Consider the function f∗ : [0,1]2 → {−1,+1} as visualized below. This function takes the value of +1 on the upper left triangle and −1 on the lower right triangle.

We would like to approximate this function f∗ using a decision tree with the maximum depth of d. We denote the best approximation with depth d as fd.
(a) [2 pts] Explain why fd with a finite d cannot represent f∗ exactly.
(b) [2 pts] Show what f4 is. You should draw the tree and include all the relevant information such as the attributes at each node, the splitting threshold, and the value of the leaves. You also need to show the regions that it induces.
Let us define the error the decision tree fd makes in approximating f∗ as
Z ed = I{fd(x) 6= f∗(x)}dx.
[0,1]2
This is simply the area in the region [0,1]2 where the function fd is different from f∗. (c) [2 pts] What is the value of e2 and e4?
(d) [2 pts] Provide the formula for ed for any even d, that is, d = 2k with k ∈ N. You need to justify your formula, but your justification does not need to be detailed.
(e) [2 pt] What does this tell us about the quality of approximation as a function of depth d?
5. Decision Trees and K-Nearest Neighbour – 50 pts. In this question, you will use the scikit-learn’s decision tree and KNN classifiers to classify tweets that have been evaluated to determine whether they agree that climate change exists, or deny that it exists. The aim of this question is for you to read the scikit-learn API and get comfortable with training/validation splits.
We will use a dataset consisting of tweets scraped from Twitter, for which sentiment analysis has classified as either “climate change asserting” (meaning agreeing that climate change is real) or “climate change denying” (meaning disagreeing with climate change). We will now refer to these sets of tweets as exists and DNE, w.r.t. sentiment towards climate change. The exists dataset contains 3088 tweets, and the DNE dataset consists of 1099 tweets from https://data.world/xprizeai-env/sentiment-of-climate-change. The data were cleaned by removing special characters and removing all link artefacts (“[link]”). The cleaned data are available as exists_climate.csv and DNE_climate.csv on the course webpage. It is expected that you use these cleaned data sources for this assignment.
You will build a decision tree and KNN to classify “climate change asserting” or “climate change denying” tweets. Instead of coding these methods yourself, you will do what we normally do in practice: use an existing implementation. You should use the DecisionTreeClassifier and KNeighborsClassifier included in scikit-learn. Note that figuring out how to use this implementation, its corresponding attributes and methods is a part of the assignment. All code should be submitted in hw1_code.py.
(a) [10 pts] Write a function load_data which loads the data, preprocesses it using a vectorizer
(b) [10 pts] (Decision Tree) Write a function select_tree_model that trains the decision tree classifier using at least 5 different sensible values of max_depth, as well as two different split criteria (Information Gain and Gini coefficient), evaluates the performance of each one on the validation set, and prints the resulting accuracies of each model.
You should use DecisionTreeClassifier, but you should write the validation code yourself. Include the output of this function in your solution.

(d) [10 pts] (Decision Tree) Write a function compute_information_gain which computes the information gain of a split on the training data. That is, compute I(Y,xi), where Y is the random variable signifying whether the tweet is climate change asserting or denying, and xi is the keyword chosen for the split. Your split should be based on whether the keyword xi exists (True) or does not exist (False). You should ignore the number of times that the keyword appears in the sentence.
Report the outputs of this function for the topmost split from the previous part, and for several other keywords.
APPENDIX A: CONVEXITY AND JENSEN’S INEQUALITY
Convexity is an important concept in mathematics with many uses in machine learning. We briefly define convex set and function and some of their properties here. Using these properties are useful in solving some of the questions in this homework. If you are interested to know more about convexity, refer to Boyd and Vandenberghe, Convex Optimization, 2004.
A set C is convex if the line segment between any two points in C lies within C, i.e., if for any x1,x2 ∈ C and for any 0 ≤ λ ≤ 1, we have
λx1 + (1 − λ)x2 ∈ C.
For example, a cube or sphere in Rd are convex sets, but a cross (a shape like X) is not.
A function f : Rd → R is convex if its domain is a convex set and if for all x1,x2 in its domain, and for any 0 ≤ λ ≤ 1, we have
f(λx1 + (1 − λ)x2) ≤ λf(x1) + (1 − λ)f(x2).
This inequality means that the line segment between (x1,f(x1)) and (x2,f(x2)) lies above the graph of f. A convex function looks like `. We say that f is concave if −f is convex. A concave function looks like a.
Some examples of convex and concave functions are (you do not need to use most of them in your homework, but knowing them is useful):
• Powers: xp is convex on the set of positive real numbers when p ≥ 1 or p ≤ 0. It is concave for 0 ≤ p ≤ 1.
• Exponential: eax is convex on R, for any a ∈ R.
• Logarithm: log(x) is concave on the set of positive real numbers.
• Norms: Every norm on Rd is convex.
• Max function: f(x) = max{x1,x2,...,xd} is convex on Rd.
• Log-sum-exp: The function f(x) = log(ex1 + ... + exd) is convex on Rd.
In words, if we apply a convex function to the expectation of a random variable, it is less than or equal to the expected value of that convex function when its argument is the random variable. If the function is concave, the direction of the inequality is reversed.
Jensen’s inequality has a physical interpretation: Consider a set X = {x1,...,xN} of points on R. Corresponding to each point, we have a probability p(xi). If we interpret the probability as mass, and we put an object with mass p(xi) at location (xi,φ(xi)), then the centre of gravity of these objects, which is in R2, is located at the point (E[X],E[φ(X)]). If φ is convex `, the centre of gravity lies above the curve x 7→ φ(x), and vice versa for a concave function a.

More products