Starting from:

$30

Network-Programming- Homework 1 Solved

Relatedly, cite all outside sources of information and ideas.

Written Problems
1.   (5 points) Show what the recursive decision tree learning algorithm would choose for the first split of the following dataset:



Assume that the criterion for deciding the best split is entropy reduction (i.e., information gain). If there are any ties, choose the first feature to split on tied for the best score. Show your calculations in your response.

(Hint: this dataset is one of the test cases in the programming assignment, so you should be able to use your answer here to debug your code.)

2.   A Bernoulli distribution has the following likelihood function for a data set D:

                                                                    p(D|θ) = θN1(1 − θ)N0,                                                                                   (1)

where N1 is the number of instances in data set D that have value 1 and N0 is the number in D that have value 0. The maximum likelihood estimate is

                                                                        .                                                                                           (2)

(a)    (5 points) Derive the maximum likelihood estimate above by solving for the maximum of the likelihood. I.e., show the mathematics that get from Equation (1) to Equation (2).

(b)   (5 points) Suppose we now want to maximize a posterior likelihood

                                                                      ,                                                                              (3)

where we use the Bernoulli likelihood and a (slight variant[1] of a) symmetric Beta prior over the Bernoulli parameter

                                                                          p(θ) ∝ θα(1 − θ)α.                                                                                  (4)

Derive the maximum posterior mean estimate.

Programming Assignment
For this homework, you will build two text categorization classifiers: one using naive Bayes and the other using decision trees. You will write general code for cross-validation that will apply to either of your classifiers.

Data and starter code: In the HW1 archive, you should find the 20newsgroups data set (also available from the original source http://qwone.com/~jason/20Newsgroups/). This data set, whose origin is some-

what fuzzy, consists of newsgroup posts from an earlier era of the Internet. The posts are in different categories, and this data set has become a standard benchmark for text classification methods.

The data is represented in a bag-of-words format, where each post is represented by what words are present in it, without any consideration of the order of the words.

We have also provided a unit test class in tests.py, which contains unit tests for each type of learning model. These unit tests may be easier to use for debugging in an IDE like PyCharm than the iPython notebook. A successful implementation should pass all unit tests and run through the entire iPython notebook

without issues. You can run the unit tests from a *nix command line with the command python -m unittest -v tests

or you can use an IDE’s unit test interface. These tests are not foolproof, so it’s possible for code that does not meet the requirements for full credit to pass the tests (and, though it would be surprising, it may be possible for full credit code to fail the tests).

Before starting all the tasks, examine the entire codebase. Follow the code from the iPython notebook to see which methods it calls. Make sure you understand what all of the code does. Your required tasks follow.

1.   (0 points) Examine the iPython notebook test predictors.ipynb. This notebook uses the learning algorithms and predictors you will implement in the first part of the assignment. Read through the data-loading code and the experiment code to make sure you understand how each piece works.

2.   (0 points) Examine the function calculate information gain in decision tree.py. The function takes in training data and training labels and computes the information gain for each feature. That is, for each feature dimension, compute

G(Y,Xj) = H(Y ) − H(Y |Xj)

= −XPr(Y = y)logPr(Y = y)+

                                                 y                                                                                                                                                         (5)

                                                X                                 X
                                                      Pr(Xj = xj)                        Pr(Y = y|Xj = xj)logPr(Y = y|Xj = xj).

                                                xj                                                                    y

Your function should return the vector

                                                                     [G(Y,X1),...,G(Y,Xd)].                                                                                     (6)



Algorithm 1 Recursive procedure to grow a classification tree

1: function FITTREE(D, depth)

2:                    if not worth splitting (because D is all one class or max depth is reached) then

3:            node.prediction ← argmaxc P(x,y)∈D I(y = c) 4: return node
5:
w ← argmaxw0 G(Y,Xw)
. See Equation (5
6:
node.test ← w
 
7:
node.left ← FITTREE(DL, depth+1)
. where DL := {(x,y) ∈ D|xw = 0}
8:
node.right ← FITTREE(DR, depth+1)
. where DR := {(x,y) ∈ D|xw = 1}
9:
return node
 
)

You will use this function to do feature selection and as a subroutine for decision tree learning. Note how the function avoids loops over the dataset and only loops over the number of classes. Follow this style to avoid slow Python loops; use numpy array operations whenever possible.

3. (5 points) Finish the functions naive bayes train and naive bayes predict in naive bayes.py. The training algorithm should find the maximum likelihood parameters for the probability distribution

Pr(yi = c)Qw∈W Pr(xiw|yi = c)

                                                 Pr(yi = c|xi) =      .

Pr(xi)

Make sure to use log-space representation for these probabilities, since they will become very small, and notice that you can accomplish the goal of naive Bayes learning without explicitly computing the prior probability Pr(xi). In other words, you can predict the most likely class label without explicitly computing that quantity.

Implement additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing) for your naive Bayes learner. One natural way to do this is to let the input parameter params simply be the prior count for each word. For a parameter α, this would mean your maximum likelihood estimates for any Bernoulli variable X would be

(# examples where X) + α

                                                        Pr(X) =     .

(Total # of examples) + 2α

Notice that if α = 0, you get the standard maximum likelihood estimate. Also, make sure to multiply α by the total number of possible outcomes in the distribution. For the label variables in the 20newsgroups data, there are 20 possible outcomes, and for the word-presence features, there are two.

4.   (5 points) Finish the functions recursive tree train and decision tree predict in decision tree.py. Note that recursive tree train is a helper function used by decision tree train, which is already completed for you. You’ll have to design a way to represent the decision tree in the model object. Your training algorithm should take a parameter that is the maximum depth of the decision tree, and the learning algorithm should then greedily grow a tree of that depth. Use the information-gain measure to determine the branches (hint: you’re welcome to use the calculate information gain function). Algorithm 1 is abstract pseudocode describing one way to implement decision tree training. You are welcome to deviate from this somewhat; there are many ways to correctly implement such procedures.

The pseudocode suggests building a tree data structure that stores in each node either (1) a prediction or (2) a word to split on and child nodes. The pseudocode also includes the formula for the entropy criterion for selecting which word to split on.

The prediction function should have an analogous recursion, where it receives a data example and a node. If the node has children, the function should determine which child to recursively predict with. If it has no children, it should return the prediction stored at the node.

5.   (5 points) Finish the function cross validate in crossval.py, which takes a training algorithm, a prediction algorithm, a data set, labels, parameters, and the number of folds as input and performs cross-fold validation using that many folds. For example, calling

params[’alpha’] = 1.0 score = cross_validate(naive_bayes_train, naive_bayes_predict, train_data, train_labels, 10, params)

will compute the 10-fold cross-validation accuracy of naive Bayes using regularization parameter α = 1.0.

The cross-validation should split the input data set into folds subsets. Then iteratively hold out each subset: train a model using all data except the subset and evaluate the accuracy on the held-out subset. The function should return the average accuracy over all folds splits.

Some code to manage the indexing of the splits is included. You are welcome to change it if you prefer a different way of organizing the indexing.

Once you complete this last step, you should be able to run the notebook cv predictors.ipynb, which should use cross validation to compare decision trees to naive Bayes on the 20-newsgroups task. Naive Bayes should do be much more accurate than decision trees, but the cross-validation should find a decision tree depth that performs a bit better than the depth hard coded into test predictors.ipynb.



[1] For convenience, we are using the exponent of α instead of the standard α−1.

More products