$25
Introduction To Machine Learning
1. (20 points) Let X = {x1,...,xn} be a set of n samples drawn i.i.d. from an univariate distribution with density function p(x|θ), where θ is an unknown parameter. In general, θ will belong to a specified subset of R, the set of real numbers. For the following choices of p(x|θ), derive the maxmimum likelihood estimate of θ based on the samples X:[1]
(a) (5 points)
(b) (5 points)
(c) (5 points) p(x|θ) = θxθ−1 ,0 ≤ x ≤ 1,0 < θ < ∞.
(d) (5 points) p(x|θ) = 1θ ,0 ≤ x ≤ θ,θ > 0.
2. (20 points) Let X = {x1,...,xn},xi ∈ Rd be a set of n samples drawn i.i.d. from a multivariate Gaussian distribution in Rd with mean µ ∈ Rd and covariance matrix Σ ∈ Rd×d. Recall that the density function of a multivariate Gaussian distribution is given by:
.
(a) (10 points) Derive the maximum likelihood estimates for the mean µ and covariance Σ based on the sample set X.1,2
(b) (5 points) Let ˆµn be the maximum likelihood estimate of the mean. Is ˆµn a biased estimate of the true mean µ? Clearly justify your answer by computing E[µˆn].
(c) (5 points) Let Σˆn be the maximum likelihood estimate of the covariance matrix. Is Σˆn a biased estimate of the true covariance Σ? Clearly justify your answer by computing
E[Σˆn].
3. (10 points) Table 1 specifies the misclassification costs for a 3-class problem including a ‘Reject’ option. Assume that a model has been trained using training data, and the model can output posterior probabilities P(C1|xtest),P(C2|xtest),P(C3|xtest) for any given test point xtest.
(a) (5 points) Assume λ = 10. For a given xtest, let the posterior probabilities for the three classes be: P(C1|xtest) = 0.5,P(C2|xtest) = 0.25,P(C3|xtest) = 0.25. Using Table 1, compute the risks for predicting x to be C1, C2,C3, and ‘Reject’ respectively. Including ‘Reject’ as a possible option, what would your predicted class for xtest be? You have to show details of your computation and justify your answer.
Predicted Class
C1
C2
C3
‘Reject’
C1
0
1
1
λ
C2
10
0
10
λ
C3
100
100
0
λ
Table 1: Misclassification costs for a 3-class problem including a ‘Reject’ option.
(b) (5 points) Assume λ = 5. For a given xtest, let the posterior probabilities for the three classes be: P(C1|xtest) = 0.4,P(C2|xtest) = 0.5,P(C3|xtest) = 0.1. Using Table 1, compute the risks for predicting x to be C1, C2,C3, and ‘Reject’ respectively. Including ‘Reject’ as a possible option, what would your predicted class for xtest be? You have to show details of your computation and justify your answer.
Programming assignment:
The next problem involves programming. For Question 3, we will be using the 2-class classification datasets from Boston50, Boston75, and the 10-class classification dataset from Digits which were used in Homework 1.
3. (50 points) We will develop two parametric classifiers by modeling each class’s conditional distribution p(x|Ci) as multivariate Gaussians with (a) full covariance matrix Σi and (b) diagonal covariance matrix Σi. In particular, using the training data, we will compute the maximum likelihood estimate of the class prior probabilities p(Ci) and the class conditional probabilities p(x|Ci) based on the maximum likelihood estimates of the mean ˆµi and the (full/diagonal) covariance Σˆi for each class Ci. The classification will be done based on the following discriminant function:
gi(x) = logp(Ci) + logp(x|Ci) .
We will develop code for a class MultiGaussClassify with two key functions:
MultiGaussClassify.fit(self,X,y,diag) and MultiGaussClassify.predict(self,X).
For fit(self,X,y,diag), the inputs (X,y) are respectively the feature matrix and class labels, and diag is boolean (TRUE or FALSE) which indicates whether the estimated class covariance matrices should be a full matrix (diag=FALSE) or a diagonal matrix (diag=TRUE).
For predict(X), the input X is the feature matrix corresponding to the test set and the output should be the predicted labels for each point in the test set.
For the class, the init (self,k,d) function can initialize the parameters for each class to be uniform prior, zero mean, and identity covariance, i.e., p(Ci) = 1/k, µi = 0 and Σi = I, i = 1,...,k. Here, the number of classes k and the dimensionality d of features is passed as an argument to the constructor of MultiGaussClassify.
We will compare the performance of three models:
(i) MultiGaussClassify with full class covariance matrices,
(ii) MultiGaussClassify with diagonal covariance matrices, and
(iii) LogisticRegression[2]
applied to three datasets: Boston50, Boston75, and Digits. Using my cross val with 5-fold cross-validation, report the error rates in each fold as well as the mean and standard deviation of error rates across folds for the three models applied to the three classification datasets You will have to submit (a) code and (b) summary of results:
(a) Code: You will have to submit code for MultiGaussClassify as well as a wrapper code hw2q3(). For the class, please use the following template: class MultiGaussClassify:
def init (self, k, d):
... def fit(self, X, y, diag=False):
... def predict(self, X):
...
Your class MultiGaussClassify should not inherit any base class in sklearn. Again, the three functions you must implement in the MultiGaussClassify class are init , fit, and predict.
The wrapper code hw2q3() (main file) has no input and is used to prepare the datasets, and make calls to my cross val(method,X,y,k) to generate the error rate results for each dataset and each method. The code for my cross val(method,X,y,k) must be yours (e.g., code you developed in HW1 with modifications as needed) and you cannot use cross val score() in sklearn. For the method argument in my cross val, you can call the method corresponding to MultiGaussClassify with full covariance matrix as just ‘multigaussclassify’ and the method corresponding to MultiGaussClassify with diagonal covariance matrix as ‘multigaussdiagclassify.’
The results should be printed to terminal (not generating an additional file in the folder). Make sure the calls to my cross val(method,X,y,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:
1. MultiGaussClassify with full covariance matrix on Boston50,
2. MultiGaussClassify with full covariance matrix on Boston75,
3. MultiGaussClassify with full covariance matrix on Digits,
4. MultiGaussClassify with diagonal covariance matrix on Boston50,
5. MultiGaussClassify with diagonal covariance matrix on Boston75,
6. MultiGaussClassify with diagonal covariance matrix on Digits,
7. LogisticRegression with Boston50,
8. LogisticRegression with Boston75, and
9. LogisticRegression with Digits.
For example, the first call to my cross val(method,X,y,k) should result in the following output:
Error rates for MultiGaussClassify with full covariance matrix on Boston50:
Fold 1: ###
Fold 2: ###
...
Fold 5: ###
Mean: ###
Standard Deviation: ###
(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (9 tables in total). Each column of the table represents a fold, and add two columns at the end to show the overall mean error rate and standard deviation over the k folds. For example:
Error rates for MGC with full cov matrix on Boston50
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Mean
SD
#
#
#
#
#
#
#
Additional instructions: Code can only be written in Python (not IPython notebook); no other programming languages will be accepted. One should be able to execute all programs directly from command prompt (e.g., “python3 hw2q3.py”) without the need to run Python interactive shell first. Test your code yourself before submission and suppress any warning messages that may be printed. Your code must be run on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu). Please make sure you specify the version of Python you are using as well as instructions on how to run your program in the README file (must be readable through a text editor such as Notepad). Information on the size of the datasets, including number of data points and dimensionality of features, as well as number of classes can be readily extracted from the datasets in scikit-learn. Each function must take the inputs in the order specified in the problem and display the output via the terminal or as specified.