Starting from:

$25

CS178 - Machine Learning & Data Mining - Python & Data Exploration - Homework 1 - Solved

In this problem, we will compute some basic statistics and create visualizations of an example data set. First, download the zip file for Homework 1, which contains some course code (the mltools directory) and a dataset of New York area real estate sales, “nyc_housing”. Load the data into Python:

import numpy as np import matplotlib.pyplot as plt

nych = np.genfromtxt("data/nyc_housing.txt",delimiter=None) # load the text file

Y = nych[:,-1]                                                           # target value (NYC borough) is the last column

X = nych[:,0:-1]                                                  # features are the other columns
1

2

3

4

5

6

These data are from the “NYC Open Data” initiative, and consist of three real-valued features and a class value Y representing in which of three boroughs the house or apartment was located (Manhattan, the Bronx, or Staten

Island).

X.shape
1.    Useto get the number of features and the number of data points. Report both numbers, mentioning which number is which. (5 points)

plt.hist
2.    For each feature, plot a histogram () of the data values. (5 points)

np.std
3.     Compute the mean & standard deviation of the data points for each feature ( np.mean ,). (5 points)

plt.plot
or
plt.scatter
4.    For each pair of features (1,2), (1,3), and (2,3), plot a scatterplot (see) of the feature values, colored according to their target value (class). (For example, plot all data points with y = 0 as blue, y = 1 as green, and y = 2 as red.) (5 points)

Problem 2: k-nearest-neighbor predictions (25 points)
knnClassify
In this problem, you will continue to use the NYC Housing data and create a k-nearest-neighbor (kNN) classifier using the providedpython class. While completing this problem, please explore the implementation to become familiar with how it works.

First, we will shuffle and split the data into training and validation subsets:

nych = np.genfromtxt("data/nyc_housing.txt",delimiter=None) # load the data

Y = nych[:,-1]

X = nych[:,0:-1]

# Note: indexing with ":" indicates all values (in this case, all rows);

# indexing with a value ("0", "1", "-1", etc.) extracts only that value (here, columns); # indexing rows/columns with a range ("1:-1") extracts any row/column in that range.

import mltools as ml

# We'll use some data manipulation routines in the provided class code

# Make sure the "mltools" directory is in a directory on your Python path, e.g.,

# export PYTHONPATH=$\$${PYTHONPATH}:/path/to/parent/dir # or add it to your path inside Python:

# import sys

# sys.path.append('/path/to/parent/dir/');

np.random.seed(0)   # set the random number seed X,Y = ml.shuffleData(X,Y); # shuffle data randomly

# (This is a good idea in case your data are ordered in some systematic way.)

Xtr,Xva,Ytr,Yva = ml.splitData(X,Y, 0.75); # split data into 75/25 train/validation
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Make sure to set the random number seed to 0 before calling shuffleData as in the example above (and in general, for every assignment). This ensures consistent behavior each time the code is run.

Learner Objects           Our learners (the parameterized functions that do the prediction) will be defined as python objects, derived from either an abstract classifier or abstract regressor class. The abstract base classes have a few useful functions, such as computing error rates or other measures of quality. More importantly, the learners will all follow a generic behavioral pattern, allowing us to train the function on one data set (i.e., set the parameters of the model to perform well on those data), and then make predictions on another data set.

Xtr,Ytr
 You can now build and train a kNN classifier onand make predictions on some data Xva with it:

knn = ml.knn.knnClassify() # create the object and train it

knn.train(Xtr, Ytr, K) # where K is an integer, e.g. 1 for nearest neighbor prediction YvaHat = knn.predict(Xva) # get estimates of y for each data point in Xva

# Alternatively, the constructor provides a shortcut to "train":

knn = ml.knn.knnClassify( Xtr, Ytr, K );

YvaHat = predict( knn, Xva );
1

2

3

4

5

6

7

If your data are 2D, you can visualize the data set and a classifier’s decision regions using the function

ml.plotClassify2D( knn, Xtr, Ytr );
# make 2D classification plot with data (Xtr,Ytr)
1

predict
 This function plots the training data and colored points as per their labels, then calls knn ’sfunction on a densely spaced grid of points in the 2D space, and uses this to produce the background color. Calling the function

 with knn=None will plot only the data.

nych
1.    Modify the code listed above to use only the first two features of X (e.g., let X be only the first two columns of, instead of the first three), and visualize (plot) the classification boundary for varying values of

plotClassify2D
K =[1, 5, 10, 50] using. (10 points)

2.    Again using only the first two features, compute the error rate (number of misclassifications) on both the training and validation data as a function of K =[1, 2, 5, 10, 50, 100, 200]. You can do this most easily with a for-loop:

K=[1,2,5,10,50,100,200]; errTrain = [None]*len(K)    # (preallocate storage for training error) for i,k in enumerate(K):

learner = ml.knn.knnClassify(... # TODO: complete code to train model Yhat = learner.predict(...         # TODO: predict results on training data errTrain[i] = ... # TODO: count what fraction of predictions are wrong

#TODO: repeat prediction / error evaluation for validation data plt.semilogx(...   #TODO: average and plot results on semi-log scale
1

2

3

4

5

6

7

8

9

semilogx
Plot the resulting error rate functions using a semi-log plot (), with training error in red and validation error in green. Based on these plots, what value of K would you recommend? (10 points)

3. Create the same error rate plots as the previous part, but with all the features in the dataset. Are the plots very different? Is your recommendation for the best K different? (5 points)

Problem 3: Naïve Bayes Classifiers (35 points)
In order to reduce my email load, I decide to implement a machine learning algorithm to decide whether or not I should read an email, or simply file it away instead. To train my model, I obtain the following data set of binary-valued features about each email, including whether I know the author or not, whether the email is long or short, and whether it has any of several key words, along with my final decision about whether to read it (y =+1

for “read”, y = −1 for “discard”).
x4
x5
y
x1
x2
x3
know author? 0
is long? 0
has ‘research’ 1
has ‘grade’ 1
has ‘lottery’

0
 read?

-1
1
1
0
1
0
-1
0
1
1
1
1
-1
1
1
1
1
0
-1
0
1
0
0
0
-1
1
0
1
1
1
1
0
0
1
0
0
1
1
0
0
0
0
1
1
0
1
1
0
1
1
1
1
1
1
-1
I decide to try a naïve Bayes classifier to make my decisions and compute my uncertainty. In the case of any ties where both classes have equal probability, we will prefer to predict class +1.

1.    Compute all the probabilities necessary for a naïve Bayes classifier, i.e., the class probability p(y) and all the individual feature probabilities p(xi|y), for each class y and feature xi. (7 points)

2.    Which class would be predicted for x =(0 0 0 0 0)? What about for x =(1 1 0 1 0)? (7 points)

3.    Compute the posterior probability that y = +1 given the observation x = (0 0 0 0 0). Also compute the posterior probability that y =+1 given the observation x =(1 1 0 1 0). (7 points)

4.    Why should we probably not use a “joint” Bayes classifier (using the joint probability of the features x, as opposed to the conditional independencies assumed by naïve Bayes) for these data? (7 points)

5.    Suppose that before we make our predictions, we lose access to my address book, so that we cannot tell whether the email author is known. Do we need to re-train the model to classify based solely on the other four features? If so, how? If not, what changes about how our trained parameters are used? Hint: what parameters do I need for a naïve Bayes model over only features x2, . . . , x5? Do I need to re-calculate any new parameter values in our new setting? What, if anything, changes about the parameters or the way they are used? (7 points)

Problem 4: Gaussian Bayes Classifiers (15 points)
Now, using the NYC Housing data, we will explore a classifier based on Bayes rule. Again, we’ll use only the first two features of NYC Housing, shuffled and split in to training and validation sets as before.

1.     Splitting your training data by class, compute the empirical mean vector and covariance matrix of the data in each class. (You can use mean and cov for this.) (5 points)

plotGauss2D
2.    Plot a scatterplot of the data, coloring each data point by its class, and useto plot contours on your scatterplot for each class, i.e., plot a Gaussian contour for each class using its empirical parameters, in the same color you used for those data points. 5 points

3.    Visualize the classifier and its boundaries that result from applying Bayes rule, using

bc = ml.bayes.gaussClassify( Xtr, Ytr ); ml.plotClassify2D(bc, Xtr, Ytr);
1

2

Also compute the empirical error rate (number of misclassified points) on the training and validation data. (5 points)

More products