$25
Conceptual and Theoretical Questions
1. This problem involves hyperplanes in two dimensions.
(a) Sketch the hyperplane 1 + 3 π1 − π2 = 0. Indicate the set of points for which 1 + 3 π1 − π2 > 0, as well as the set of points for which 1 + 3 π1 − π2 < 0.
(b) On the same plot, sketch the hyperplane −2 + π1 + 2 π2 = 0. Indicate the set of points for which
−2 + π1 + 2 π2 > 0, as well as the set of points for which −2 + π1 + 2 π2 < 0. Points: 5
2. We have seen that in p = 2 dimensions, a linear decision boundary takes the form π½0 + π½1 π1 + π½2 π2 = 0.We now investigate a non-linear decision boundary.
(a) Sketch the curve
(1 + π )2 + (2 − π )2 = 4.
(b) On your sketch, indicate the set of points for , as well as the set of points for which (1 + π1)2 + (2 − π2)2 ≤ 4. (c) Suppose that a classifier assigns an observation to the blue class if
(1 + π1)2 + (2 − π2)2 > 4,
and to the red class otherwise. To what class is the observation (0, 0) classified? (−1, 1)? (2, 2)? (3, 8)? (d) Argue that while the decision boundary in (c) is not linear in terms of π1 and π2, it is linear in terms of
π1, π2, π12, and π22.
Points:5
3. Here we explore the maximal margin classifier on a toy data set.
(a) We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class label.
(b) Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form (9.1)).
(c) Describe the classification rule for the maximal margin classifier.
It should be something along the lines of “Classify to Red if π½0 + π½1 π1 + π½2 π2 > 0, and classify to Blue otherwise.” Provide the values for π½0, π½1, and π½2.
(d) On your sketch, indicate the margin for the maximal margin hyperplane.
(e) Indicate the support vectors for the maximal margin classifier.
(f) Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.
(g) Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.
(h) Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.
Points: 5
4. Suppose that we have four observations, for which we compute a dissimilarity matrix, given by
For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.
(a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations using complete linkage. Be sure to indicate on the plot the height at which each fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.
(b) Repeat (a), this time using single linkage clustering.
(c) Suppose that we cut the dendogram obtained in (a) such that two clusters result. Which observations are in each cluster?
(d) Suppose that we cut the dendogram obtained in (b) such that two clusters result. Which observations are in each cluster?
(e) It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.
Points: 5
5. In this problem, you will perform K-means clustering manually, with K = 2, on a small example with n = 6 observations and p = 2 features. The observations are as follows.
(a) Plot the observations.
(b) Randomly assign a cluster label to each observation. Report the cluster labels for each observation.
(c) Compute the centroid for each cluster.
(d) Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.
(e) Repeat (c) and (d) until the answers obtained stop changing.
(f) In your plot from (a), color the observations according to the cluster labels obtained. Points: 5
6. Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms.
(a) At a certain point on the single linkage dendrogram, the clusters {1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram, the clusters {1, 2, 3} and {4, 5} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell? (b) At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell? Points: 5
7. In words, describe the results that you would expect if you performed K-means clustering of the eight shoppers in Figure 10.14 – shown below, on the basis of their sock and computer purchases, with K = 2. Give three answers, one for each of the variable scalings displayed. Explain.
Points: 5
Applied Questions
1. We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary.We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a
quadratic decision boundary between them. For instance, you can do this as follows:
> π1=random.uniform (500) -0.5
https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html > π2= random.uniform (500) -0.5
> π¦ = 1 ∗ ( π12 − π22 > 0)
(b) Plot the observations, colored according to their class labels. Your plot should display π1 on the x-axis, and π2 on the y-axis.
(c) Fit a logistic regression model to the data, using π1 and π2 as predictors.
(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
(e) Now fit a logistic regression model to the data using non-linear functions of π1 and π2 as predictors (e.g.
π12, π1 × π2, log(π2), and so forth).
(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
(g) Fit a support vector classifier to the data with π1 and π2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels. (i) Comment on your results.
Hint:
- Check https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html Points: 15
2. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
(b) Fit a support vector classifier to the data with the linear kernel, in order to predict whether a car gets high or low gas mileage. Report the cross-validation error. Comment on your results.
(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree. Comment on your results.
(d) Make some plots to back up your assertions in (b) and (c).
Hints:
- Check https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html - Check https://scikit-learn.org/0.18/auto_examples/svm/plot_iris.html Points: 15
3. Consider the USArrests data. We will now perform hierarchical clustering on the states.
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states. (b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
Hints:
- Check
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html Points: 15
4. In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables. Use uniform or normal distributed samples.
(b) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors. Hint: you can assign different means to different classes to create separate clusters.
(c) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?
(d) Perform K-means clustering with K = 2. Describe your results.
(e) Now perform K-means clustering with K = 4, and describe your results.
(f) Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 x 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
(g) Using the z-score function to scale your variables, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.
Hints:
- Check https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- Check https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html - Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html Points: 25