I Datasets Construct your own toy datasets to visualize the result easily. Take the input instances and the labels (classification) or R (regression). You can also take x ∈ R2 and use surface or contour plots.
II Using nonparametric methods k-nearest-neighbor (KNN) classifier In file lab03 knn2.m we apply the KNN classifier to datasets in 2D and plot its results. Explore its behavior in various settings (see suggestions at the end of the file). Note: this is the k-nearest-neighbor classifier, not the k-nearest-neighbor density estimate.
Kernel density estimate (KDE) In file lab03 kde1.m we have implemented the following methods for datasets in 1D (i.e., feature vectors xn ∈ R):
• Histogram with origin x0 ∈ R and bin width h 0, for density estimation. We plot the resulting density estimate p(x) using a bar chart.
• Gaussian kernel density estimate with bin width h 0:
Gaussian kernel:
We plot:
– Density estimation: the resulting density estimate p(x) as a continuous curve in R.
– Classification: the resulting posterior distribution estimate p(k|x) for each class k = 1,...,K as a continuous curve in R, using a different color for each class; and the data points colored according to the predicted label argmaxk=1,...,K p(k|x).
– Regression: the resulting regression function g(x) as a continuous curve in R.
Using this code, explore histograms and KDEs in various settings (see suggestions at the end of file lab03 kde1.m). Consider the following questions:
• How does the histogram change if you change x0? How does it change if you change h?
• How does the result change if you change h? How does the estimated density p(x) and the regression function g(x) behave for h → 0 and for h → ∞?
• How well does the estimated density p(x) or regression function g(x) approximate the true one?
• How does the regression function g(x) behave near discontinuities in the true function f(x), or in regions x ∈ R that have no data points?
• For classification and for density estimation, how do Gaussian KDEs compare with Gaussian classifiers?
Further things to do:
1
• Extend the code to work with 2D datasets. Use the plots in lab02.m as a guideline.
• Extend the code to work with 1D datasets but use kernels other than the Gaussian, specifically use the following two kernels in eq. (1):
Uniform: Epanechnikov:
, otherwise
The following Matlab functions will be useful (among others): hist bar randn rand find linspace scatter mode.