$25
CS 273A: Machine Learning
This homework (and many subsequent ones) will involve data analysis and reporting on methods and results using Python code. You have to submit a single PDF file that contains everything to Gradescope, and associated each page of the PDF to each problem. This includes any text you wish to include to describe your results, the complete code snippets of how you attempted each problem, any figures that were generated, and scans of any work on paper that you wish to include. It is important that you include enough detail that we know how you solved the problem, since otherwise we will be unable to grade it.
I recommend that you use Jupyter/iPython notebooks to write your report. It will help you not only ensure all of the code for the solutions is included, but also provide an easy way to export your results to a PDF file [1]. I recommend liberal use of Markdown cells to create headers for each problem and sub-problem, explaining your implementation/answers, and including any mathematical equations. For parts of the homework you do on paper, scan it in such that it is legible (there are a number of free Android/iOS scanning apps, if you do not have access to a scanner), and include it as an image in the iPython notebook[2]. If you have any questions/concerns about using iPython, ask us on Campuswire. If you decide not to use iPython notebooks, but go with Microsoft Word or Latex to create your PDF file, you have to make sure all of the answers can be generated from the code snippets included in the document.
Problem 0: Get Connected
Please visit our class forum on Campuswire: https://campuswire.com/p/GAF58E3D6. Campuswire will be the place to post your questions and discussions, rather than by email to me or the TAs, since chances are that other students have the same or similar questions, and will be helped by seeing the discussion. Remember, your Campuswire participation will be taken into account for the participation grade as well. You do not need to mention anything regarding this in the report, we will be able to check whether you have visited Campuswire or not.
Problem 1: Python & Data Exploration
In this problem, we will explore some basic statistics and visualizations of an example data set. First, download the zip file for Homework 1, which contains some course code (the mltools directory) and the “Fisher iris” data set, and load the latter into Python:
import numpy as np
import matplotlib.pyplot as plt
iris = np.genfromtxt("data/iris.txt",delimiter=None) # load the text file
Y = iris[:,-1] # target value is the last column
X = iris[:,0:-1] # features are the other columns
1
2
3
4
5
6
The Iris data consist of four real-valued features used to predict which of three types of iris flower was measured (a three-class classification problem).
X.shape
1. Useto get the number of features and the data points. Report both numbers, mentioning which number is which.
plt.hist
2. For each feature, plot a histogram () of the data values
np.std
3. Compute the mean & standard deviation of the data points for each feature ( np.mean ,)
plt.plot
or
plt.scatter
4. For each pair of features (1,2), (1,3), and (1,4), plot a scatterplot (see) of the feature values, colored according to their target value (class). (For example, plot all data points with y = 0 as blue, y = 1 as green, etc.)
Problem 2: kNN predictions
knnClassify
In this problem, you will continue to use the Iris data and explore a KNN classifier using provided python class. While doing the problem, please explore the implementation to become familiar with how it works. First, we will shuffle and split the data into training and validation subsets:
iris = np.genfromtxt("data/iris.txt",delimiter=None) # load the data
Y = iris[:,-1]
X = iris[:,0:-1]
# Note: indexing with ":" indicates all values (in this case, all rows);
# indexing with a value ("0", "1", "-1", etc.) extracts only that value (here, columns); # indexing rows/columns with a range ("1:-1") extracts any row/column in that range.
import mltools as ml
# We'll use some data manipulation routines in the provided class code
# Make sure the "mltools" directory is in a directory on your Python path, e.g.,
# export PYTHONPATH=$\$${PYTHONPATH}:/path/to/parent/dir # or add it to your path inside Python:
# import sys
# sys.path.append('/path/to/parent/dir/');
X,Y = ml.shuffleData(X,Y); # shuffle data randomly
# (This is a good idea in case your data are ordered in some pathological way, # as the Iris data are)
Xtr,Xva,Ytr,Yva = ml.splitData(X,Y, 0.75); # split data into 75/25 train/validation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
You may also find it useful to set the random number seed at the beginning (in general, for every assignment),
numpy.random.seed(0)
e.g.,, to ensure consistent behavior each time.
Learner Objects Our learners (the parameterized functions that do the prediction) will be defined as python objects, derived from either an abstract classifier or abstract regressor class. The abstract base classes have a few useful functions, such as computing error rates or other measures of quality. More importantly, the learners will all follow a generic behavioral pattern, allowing us to train the function on a data set (i.e., set the parameters of the model to perform well on those data), and make predictions on a data set.
Xtr,Ytr
You can build now and train a kNN classifier onand make predictions on some data Xva with it:
knn = ml.knn.knnClassify() # create the object and train it
knn.train(Xtr, Ytr, K) # where K is an integer, e.g. 1 for nearest neighbor prediction YvaHat = knn.predict(Xva) # get estimates of y for each data point in Xva
# Alternatively, the constructor provides a shortcut to "train": knn = ml.knn.knnClassify( Xtr, Ytr, K );
YvaHat = predict( knn, Xva );
1
2
3
4
5
6
7
If your data are 2D, you can visualize a data set and a classifier’s decision regions using e.g.,
ml.plotClassify2D( knn, Xtr, Ytr );
# make 2D classification plot with data (Xtr,Ytr)
1
predict
This function plots the training data and colored points as per their labels, then calls knn ’sfunction on a densely spaced grid of points in the 2D space, and uses this to produce the background color. Calling the function
with knn=None will plot only the data.
1. Modify the code listed above to use only the first two features of X (e.g., let X be only the first two columns of iris , instead of the first four), and visualize (plot) the classification boundary for varying values of
plotClassify2D
K =[1, 5, 10, 50] using.
2. Again using only the first two features, compute the error rate (number of misclassifications) on both the training and validation data as a function of K =[1, 2, 5, 10, 50, 100, 200]. You can do this most easily with a for-loop:
K=[1,2,5,10,50,100,200]; for i,k in enumerate(K):
learner = ml.knn.knnClassify(... # TODO: complete code to train model Yhat = learner.predict(... # TODO: predict results on training data errTrain[i] = ... # TODO: count what fraction of predictions are wrong
#TODO: repeat prediction / error evaluation for validation data plt.semilogx(... #TODO: average and plot results on semi-log scale
1
2
3
4
5
6
7
8
semilogx
Plot the resulting error rate functions using a semi-log plot (), with training error in red and validation error in green. Based on these plots, what value of K would you recommend?
3. Provide the same plots as the previous, but with all the features in the dataset. Are the plots very different? Is your recommendation different?
Problem 3: Naïve Bayes Classifiers
In order to reduce my email load, I decide to implement a machine learning algorithm to decide whether or not I should read an email, or simply file it away instead. To train my model, I obtain the following data set of binary-valued features about each email, including whether I know the author or not, whether the email is long or short, and whether it has any of several key words, along with my final decision about whether to read it (y =+1
for “read”, y = −1 for “discard”).
x4
x5
y
x1
x2
x3
know author? 0
is long? 0
has ‘research’ 1
has ‘grade’ 1
has ‘lottery’
0
read?
-1
1
1
0
1
0
-1
0
1
1
1
1
-1
1
1
1
1
0
-1
0
1
0
0
0
-1
1
0
1
1
1
1
0
0
1
0
0
1
1
0
0
0
0
1
1
0
1
1
0
1
1
1
1
1
1
-1
In the case of any ties, we will prefer to predict class +1.
I decide to try a naïve Bayes classifier to make my decisions and compute my uncertainty.
1. Compute all the probabilities necessary for a naïve Bayes classifier, i.e., the class probability p(y) and all the individual feature probabilities p(xi|y), for each class y and feature xi
2. Which class would be predicted for x =(0 0 0 0 0)? What about for x =(1 1 0 1 0)?
3. Compute the posterior probability that y =+1 given the observation x =(1 1 0 1 0).
4. Why should we probably not use a “joint” Bayes classifier (using the joint probability of the features x, as opposed to a naïve Bayes classifier) for these data?
5. Suppose that, before we make our predictions, we lose access to my address book, so that we cannot tell whether the email author is known. Should we re-train the model, and if so, how? (e.g.: how does the model, and its parameters, change in this new situation?) Hint: what will the naïve Bayes model over only features x2 . . . x5 look like, and what will its parameters be?