$25
This assignment is related to the simulation study described in Section 2.3.1 (the so-called Scenario 2) of “Elements of Statistical Learning” (ESL).
Scenario 2: the two-dimensional data X ∈ R2 in each class is generated from a mixture of 10 different bivariate Gaussian distributions with uncorrelated components and different means, i.e.,
,
where k = 0,1, l = 1 : 10, P(Y = k) = 1/2, and P(Z = 1) = 1/10. In other words, given Y = k, X follows a mixture distribution with density function
.
You can choose your own values for s and the twenty 2-dim vectors mkl, or you can generate them from some distribution.
Repeat the following simulation 20 times. In each simulation, following the data generating process,
1. generate a training sample of size 200 and a test sample of size 10,000, and
2. calculate the training and test errors (the averaged 0/1 error1 ) for the following four procedures:
• Linear regression with cut-off value 0.5,
• quadratic regression with cut-off value 0.5,
• kNN classification with k chosen by 10-fold cross-validation, and
• the Bayes rule (assume your know the values of mkl’s and s).
Summarize your results on training errors and test errors graphically, e.g., using boxplot or stripchart. Also report the mean and standard error for the selected k values.
R packages you are allowed to use are class (for kNN) and ggplot2 (for graphs).
1For each sample, the incurred error is 1 if there is a mistake, and 0 otherwise.