$30
Show your work. Include any code snippets you used to generate an answer, using comments in the code to clearly indicate which problem corresponds to which code
1. (2 points) In Python, generate a (2-dimensional multivariate Gaussian) data matrix D using the following code:
mu = np.array([0,0])
Sigma = np.array([[1,0], [0, 1]])
X1, X2 = np.random.multivariate_normal(mu, Sigma, 1000).T D = np.array([X1, X2]).T
Create a scatter plot of the data, with the x-axis corresponding to the first attribute (column) in D, and the y-axis corresponding to the second attribute (column) in D.
2. (7 points) Using the scaling matrix S and rotation matrix R below to transform the data D from Question 1, by multiplying each data instance (row) xi by RS. Let DRS be the matrix of the transformed data. That is, each 2-dimensional row vector xi in D should be transformed into a 2-dimensional vector RSxi in DRS.
(a) (4 points) Plot the transformed data DRS in the same figure as the original data D, using different colors to differentiate between the original and transformed data.
(b) (2 points) Write down the covariance matrix of the transformed data DRS.
(c) (1 point) What is the total variance of the transformed data DRS.
3. (8 points) Use sklearn’s PCA function to transform the data matrix DRS from Question 2 to a 2-dimensional space where the coordinate axes are the principal components.
(a) (4points) Plot the PCA-transformed data, with the x-axis corresponding to the first principal component and the y-axis corresponding to the second principal component.
(b) (2 points) What is the estimated covariance matrix of the PCA-transformed data?
(c) (2 points) What is the fraction of the total variance captured in the direction of the first principal component? What is the fraction of the total variance captured in the direction of the second principal component?
4. (18 points) Load the Boston data set into Python using sklearn’s datasets package. Use sklearn’s PCA function to reduce the dimensionality of the data to 2 dimensions.
(a) (5 points) First, standard-normalize the data. Then, create a scatter plot of the 2dimensional, PCA-transformed normalized Boston data, with the x-axis corresponding to the first principal component and the y-axis corresponding to the second principal component.
(b) (3 points) Create a plot of the fraction of the total variance explained by the first r components for r = 1,2,...,13.
(c) (2 points)
i. (1 point) If we want to capture at least 90% of the variance of the normalized Boston data, how many principal components (i.e., what dimensionality) should we use?
ii. (1 point) If we use two principal components of the normalized Boston data, how much (what fraction or percentage) of the total variance do we capture?
(d) (4 points) Use scikit-learn’s implementation of k-means to find 2 clusters in the twodimensional, PCA-transformed normalized Boston data set (the input to k-means should be the data that was plotted in part 4e). Plot the 2-dimensional data with colors corresponding to predicted cluster membership for each point. On the same plot, also plot the two means found by the k-means algorithm in a different color than the colors used for the data.
(e) (4 points) Use scikit-learn’s implementation of DBSCAN to find clusters in the twodimensional, PCA-transformed normalized Boston data set (the input to DBSCAN should be the data that was plotted in part ). Plot the 2-dimensional data with colors corresponding to predicted cluster membership for each point. Noise points should be colored differently than any of the clusters. How many clusters were found by DBSCAN?