Starting from:

$30

CSCI567-Homework 4 Generative models,Mixture density models and Naive Bayes Solved

1         Generative models
[Recommended maximum time spent: 1 hour]

Q1.1 Let X ∈ R be a random variable. We assume that it is uniformly-distributed on some unknown interval (0,θ], where θ 0. In particular,

                                                                                              (1θ          , if x ∈ (0,θ],                                      (1)

P(X = x|θ) =

                                                                                                 0     , otherwise,

                                                                                         ,                                             (2)

where 1 is an indicator function that outputs 1 when the condition is true, and 0 otherwise.

Suppose x1,x2,...,xN are drawn i.i.d. from this distribution. Write down the likelihood of the observations P(x1,...,xN). What is the maximum likelihood (ML) estimate of θ? Give a sentence or two explaining why your equation corresponds to the maximum likelihood estimate.

What to submit: The equation for the likelihood (at most two lines), the final equation of ML estimate for θ (one line), and fewer-than-two sentences of explanation.

Q1.2 Now suppose X is distributed according to a mixture of two uniform distributions: one on some unknown interval (0,θ1] and the other on (0,θ2], where 0 < θ1 ≤ θ2. In particular,

                                                           P(X = x) = ω1U(X = x|θ1) + ω2U(X = x|θ2),                               (3)

where U is the uniform distribution defined as in Eq. (1) or Eq. (2), and ω1, ω2 are mixture weights such that

                                                                      ω1 ≥ 0,ω2 ≥ 0, and ω1 + ω2 = 1.                                          (4)

Suppose x1,x2,...,xN are drawn i.i.d. from this mixture of uniform distributions. We will use an EM algorithm to derive the ML estimates of the parameters in this model. Answer the following three questions.

•    First, what is the form of P(k|xn,θ1,θ2,ω1,ω2), where k ∈ {1,2} indicates the corresponding mixture component? You answer should be explicit. You may include the indicator function as in Eq. (2) and the U function as in Eq. (3).

•    Second, what is the form of the expected complete-data log-likelihood of the observations {x1,...,xN}, given that the parameters from the last EM iteration are

0,ω2OLD 0}, where ? You answer should be explicit. You may include the indicator function as in Eq. (2) and the U function as in Eq. (3).

Hint: Please refer to page 34/44 and other related pages of lecture 14.

•    Third, what are the forms of the M-step updates for θ1 and θ2? You may use POLD(k|xn) =

), where k ∈ {1,2}, to simplify your derivation/answer if

needed. You can view log0 = − and 0 × log0 = 0 in this question.

What to submit: The form of P(k|xn,θ1,θ2,ω1,ω2) (at most two lines), the form of the expected complete-data log-likelihood (at most five lines), and the forms of the M-step update (at most ten lines).

2         Mixture density models
[Recommended maximum time spent: 1 hour]

Consider a density model given by a mixture distribution

K

P(x) = XπkP(x|k),

k=1

K

where πk ≥ 0 ∀k ∈ [K] and Xπk = 1,

k=1

and suppose that we partition the vector x into two parts so that x . Show that the

conditional density P(xb|xa) is itself a mixture distribution. That is,

K

P(xb|xa) = XλkP(xb|xa,k),

k=1

K

where λk ≥ 0 ∀k ∈ [K] and Xλk = 1.

k=1

Q2.1      Find an expression for the mixing coefficients λk of the component densities in terms of πk and P(xa|k). Do not forget to verify if your answer obeys the constraint on λk mentioned above.

Hint: a) λk is a function of xa instead of a constant. b) You may consider Bayes rule for derivation.

What to submit: No more than ten lines of derivation that leads to an expression for the mixing coefficients λk.

3         The connection between GMM and K-means
[Recommended maximum time spent: 1 hour]

Consider a Gaussian mixture model (GMM) in which all components have (diagonal) covariance Σ = σ2I and the K-means algorithm introduced in lecture 14.

Q3.1 In the case where both the GMM and the K-means algorithm have K components and the parameters πk are pre-defined to be nonzero ∀k ∈ [K], show that in the limit σ → 0, maximizing the following expected complete-data log likelihood w.r.t. for the GMM model

N      K                 N                 K XX       XX          |              2

                                          γ(znk)logp(xn,zn = k) =                            γ(znk)[logπk + logN(xn µk,σ I)],

                         n       k                                                                             n       k

where

is equivalent (up to a scaling or constant factor) to minimizing the distortion measure J w.r.t.

 for the K-means algorithm given by

                                                                   K       N

J = XXrnk||xn − µk||22,

                                                                    k        n

(

1, if k = argmin,

                                               where      rnk =

0, otherwise.

Hint: Start by showing that γ(znk) → rnk as σ → 0. Note that for this question the only set of parameters to learn for the GMM are . Any term independent of  can be treated as a constant.

What to submit: Derivation and reasoning with less than ten lines.

4         Naive Bayes
[Recommended maximum time spent: 1 hour]

Recall the naive Bayes model we have seen in class. Given a random variable X ∈ RD and a dependent class variable Y ∈ [C], the joint distribution of features X and class Y is defined as

                                                         P(X = x,Y = c) = P(Y = c)P(X = x|Y = c)                                        (5)

D

                                                                                         = P(Y = c) Y P(Xd = xd|Y = c)                           (6)

d=1

In this problem, we consider a naive Bayes model where each feature xd of each class c is modeled as a (different) Gaussian. That is,

                                          ,              (7)

where µcd and σcd are the mean and the standard deviation, respectively.

Moreover, we model Y as a multinomial distribution with parameter π (Note: this π is a parameter, whereas π in Eqn. (7) is a constant). That is,

C

                                                            P(Y = c;π) = πc ≥ 0 ∀c ∈ [C], and Xπc = 1.                                (8)

c

Q4.0        (Warm-up) What are the parameters to be learned in this model?

What to submit: Nothing.

Q4.1 Given the dataset {(xn ∈ RD,yn ∈ [C])}Nn=1, assumed to be drawn i.i.d. from this model, write down explicitly the expression for the joint log-likelihood.

What to submit: The expression for the joint log-likelihood (no more than ten lines). Parameters to be learned should all be in the expression.

Q4.2 Using the objective in Q4.1, derive the maximum likelihood (ML) estimates of the parameters. You should be able to compute these estimates from the training data {(xn ∈ RD,yn ∈

.

What to submit: The final equation of ML estimate as well as your short (no more than five lines) derivation for each parameter.

Programming component

5         High-level descriptions
5.1        Datasets
For Sect. 6 and 7, we will use 2-dimensional synthetic data hw4 circle.json and hw4 blob.json to perform clustering and density estimation. For Sect. 8, we will use the email dataset released during the 2001 Enron investigation for (binary) spam classification. See the text of the corresponding sections for more details.

5.2        Tasks
You will be asked to implement the K-means algorithm for clustering (Sect. 6), the Gaussian mixture model for density estimation (Sect. 7), and the naive Bayes model for binary classification (Sect. 8). Specifically, you will

•    For Sect. 6: finish implementing the function kmeans clustering and related functions in kmeans.py. Run the script kmeans.sh, which will output kmeans.json. Add, commit, and push kmeans.json and kmeans.py.

•    For Sect. 7: finish implementing the function gmm clustering in gmm.py. Run the script gmm.sh, which will output gmm.json. Add, commit, and push gmm.json and gmm.py.

•    For Sect. 8: finish implementing the following three functions—nb train, nb predict, and nb train smoothing—in nb spam.py. Run the scripts nb1.sh and nb2.sh, which will output nb1.json and nb2.json, respectively. Add, commit, and push nb spam.py, nb1.json, and nb2.json.

As in the previous homework, you are not responsible for loading/pre-processing data; we have done that for you. For specific instructions, please refer to text in Sect. 6, 7, 8, and the instructions in their corresponding scripts.

5.3        Cautions
Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required files, including your code and generated output files.

6         K-means
As introduced in the class, the K-means algorithm tries to minimize the following distortion measure (or objective function):

                                                                                  K       N

J = XXrnk||xn − µk||22,

                                                                                  k       n

where rnk is an indicator variable:

(

1, if k = argmin,

rnk =

0, otherwise.

and {µ1,...,µk} are the cluster centers, each with the same dimensionality of data points.



Q6.1                Implement K-means using random initialization for cluster centers. Run the algorithm on

2 two-dimensional synthetic datasets hw4 circle.json and hw4 blob.json (see Fig. 1 for illustration) with different values of K ∈ {2,3,5}. The algorithm should run until none of the cluster assignments are changed.

What to do and submit: Finish the function kmeans clustering and related functions in kmeans.py. Please do not change the random seeds set in our template code. Detailed implementation instructions are provided in the python script. Then, run kmeans.sh, which will generate a kmeans.json file containing cluster center and the cluster assigned to each data point with all values of K. Add, commit, and push the modified kmeans.py and kmeans.json before the due date.

Q6.2 You are encouraged to plot the clustering assignments by different colors and markers. Does the algorithm work well in separating the two circles in the hw4 circle.json dataset and why?

What to submit: Nothing.

7         Gaussian Mixture Models
Q7.1 In this problem, you will implement an EM algorithm to fit a Gaussian mixture model on the hw4 blob.json dataset.

What to do and submit: Finish the implementation of the function gmm clustering in gmm.py. Detailed implementation instructions are provided in the python script. Then, run gmm.sh. It will run 5 rounds of your EM algorithm with the number of components K = 3 and generate a gmm.json file containing the mean and co-variance matrices of all the three Gaussian components. Add, commit, and push the modified gmm.py and gmm.json before the due date.

Q7.2 You are encouraged to plot the most likely clustering assignments by different colors and markers. Are the results from those 5 rounds the same? Why or why not?

What to submit: Nothing.

8         Naive Bayes
We will build a (binary) spam classifier using a naive Bayes model that classifies each document as “Spam” (1) or “Ham” (0).

Dataset            We will use the dataset drawn from emails released during the 2001 Enron investigation.

In what follows, we will describe this dataset in detail. However, note that we have provided a python function that handles reading these files for you. In the directory enron, you will find

•    Word ID to word mapping in ind to vocab.txt.       Specifically, each line consists of word ID and its corresponding actual word (space-separated). There are 158,963 words in the vocabulary.

•    20,229 training emails in train features.txt and their labels in train labels.txt.

•    6,743 validation emails in val features.txt and their labels in val labels.txt.

0 word_id_0,1 frequency_word_id_0,1 word_id_0,2 frequency_word_id_0,2 . . word_id_0,D0 frequency_word_id_0,D0

# 1 word_id_1,1 frequency_word_id_1,1 word_id_1,2 frequency_word_id_1,2 . . word_id_1,D2 frequency_word_id_1,D1

# .

.

.

# N word_id_N,1 frequency_word_id_N,1 word_id_N,2 frequency_word_id_N,2 . . word_id_N,DN frequency_word_id_N,DN

#
Figure 2: Format of * features.txt

•    6,744 sampled test emails in sampled test features.txt and their labels in sampled test labels.txt.

The input format of * features.txt is shown in Fig. 2. Each of these features files consists of multiple documents separated by #. Each document starts with its ID n followed by Dn lines, where each line consists of word id and its frequency (space-separated). For example, the word word id 3,5 appears in document 3 for frequency word id 3,5 times. The labels of these documents are in * labels.txt, in which each line consists of document ID and its label (0 or 1).

Q8.1 Understand Data Format In nb spam.py, we have provided the function load docs which reads documents (words and their corresponding frequencies) for you. This function takes in two file names (features and labels), and outputs two lists. The first list contains documents, each of which is a list of tuples of word and its frequency in the document. The second contains labels of those documents. Your task is to read this function to understand the format of the data you will be working with.



Q8.2 Train and Predict Please finish the implementation of the functions—nb train and nb predict—in nb spam.py:

•    nb train estimates the parameters of the naive Bayes model, including the probability distribution of class being Ham or Spam a priori as well as the class-specific word probability distributions.

•    nb predict uses the estimates from nb train to predict which class each new document belongs to.

Please follow the following implementation details. First, use the log-likelihood instead of the likelihood to avoid any numerical underflow issues. Moreover, if your model decides that a document has an equal chance of being Ham (0) or Spam (1), it should predict a Spam (1).

Finally, we will have to take care of unseen words. Unseen words are defined as words that do not appear in the training documents. In this question, we will ignore all unseen words during prediction. In addition, if you encounter computing log of a zero, replace the output with -1e50. This scenario could happen when a word appears in the training documents that belong to one class but it does not appear in those that belong to the other class.

What to do and submit: Finish the implementation of the functions nb train and nb predict in nb spam.py. Then, run script nb1.sh, which will generate a nb1.json file containing training, validation, and test accuracies. You should see that all accuracies are above 97%. Add, commit, and push the modified nb spam.py and nb1.json before the due date.

Q8.3 Train with smoothing We now take another approach to dealing with unseen words: a smoothing technique. Let α 0 be a continuous-valued smoothing parameter. For each word in the vocabulary in each class, you will assume that you have seen it α times before even seeing any training data. That is, you should add α to its count before you train your model. This assumption will affect class-specific word distribution estimates. (Hint: you should see that α (and adjusted counts) affects both the numerator and the denominator in the formula of your parameter estimates).

Please finish the implementation of the function nb train smoothing in nb spam.py with the assumption described above. nb train smoothing estimates the parameters of the naive Bayes model, similar to nb train. Implementation details in the previous question apply to this question. Try α ∈ {1e-8, 1e-7, 1e-6, 1e-5, 1e-4}.

What to do and submit: Finish the implementation of the function nb train smoothing in nb spam.py. Then, run script nb2.sh, which will generate a nb2.json file containing training, validation, and test accuracies for all values of α. You should see that, for the best α, all accuracies are above 98%. Add, commit, and push the modified nb spam.py and nb2.json 

More products