1 Hidden Markov Models
[Recommended maximum time spent: 2 hours]
In this problem, we want to fit DNA sequence data with a generative model. In particular, we assume that they are generated by a hidden Markov model (HMM). Let X1:N = [X1X2 ...XN] be random variables corresponding to a DNA sequence of length N, controlled by hidden states Z1:N = [Z1Z2 ...ZN]. Each Xn takes a value in {A,C,G,T} and each Zn takes one of the two possible states {s1,s2}. This HMM has the following parameters Θ = {πi,aij,bik} for i ∈ {1,2}, j ∈ {1,2}, and k ∈ {A,C,G,T}:
• Initial state distribution πi for i = 1,2:
π1 = P(Z1 = s1) = 0.7; π2 = P(Z1 = s2) = 0.3. (1)
• Transition probabilities aij = P(Zn+1 = s1|Zn = si) for any n ∈ N+,i = 1,2 and j = 1,2:
a11 = 0.8, a12 = 0.2, a21 = 0.4, a22 = 0.6. (2)
• Emission probabilities bik = P(Xn = k|Zn = si) for any n ∈ N+,i = 1,2 and k ∈ {A,C,G,T}:
b1A = 0.4,
b1C = 0.1,
b1G = 0.4,
b1T = 0.1;
(3)
b2A = 0.2,
b2C = 0.3,
b2G = 0.2,
b2T = 0.3.
(4)
We observe a sequence O1:6 = [o1o2 ...o6] = [AGCGTA], please answer the following questions with step-by-step computations.
Q1.1 Probability of an observed sequence Compute P(X1:6 = O1:6;Θ).
Q1.2 Most likely explanation Compute z ] = argmaxz1:6 P(Z1:6 = z1:6|X1:6 =
O1:6;Θ).
Q1.3 Prediction Compute x∗ = argmaxx P(X7 = x|X1:6 = O1:6;Θ).
What to submit: Mathematical derivations and numerical results for each of the problems above. You should derive them manually.
Programming component
2 High-level descriptions
2.1 Datasets
In Sect. 3, we will play with a small hidden Markov model. The parameters of the model are given in hmm model.json. In Sect. 4, you will perform dimensionality reduction and visualize a subset of MNIST: 784-dimensional feature vectors in mnist2500 X.txt and their corresponding labels in mnist2500 labels.txt.
2.2 Tasks
You will be asked to implement hidden Markov model (HMM) inference procedures and the tdistributed stochastic neighbor embedding (t-SNE) algorithm. Specifically, you will
• For Sect. 3: Finish implementing the function forward, backward, seqprob forward, seqprob backward, and viterbi. Refer to hmm.py for more information.
• For Sect. 4: Finish implementing the function pca, compute Q and compute gradient. Refer to tsne.py for more information.
• Run the scripts hmm.sh, pca.sh, and tsne.sh after you finish your implementation. hmm.sh will output hmm.txt. pca.sh will output pca.npy. tsne.sh will show visualization of the dataset and will also output Y.npy.
• Add, commit, and push hmm.py and tsne.py, and the files that you have generated.
As in the previous homework, you are not responsible for loading/pre-processing data.
2.3 Cautions
Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required files, including your code and generated output files.
3 Hidden Markov Models
In this problem, you are given parameters of a small hidden Markov model and you will implement three inference procedures, similar to what you have done manually in the first question. In hmm model.json, you will find the following model parameters:
• π: the initial probabilities, πi = P(Z1 = si);
• A: the transition probabilities, with aij = P(Zt = sj|Zt−1 = si);
• B: the observation probabilities, with bik = P(Xt = ok|Zt = si).
Now we observe a sequence O of length L. Your task is to write code to compute probabilities and infer about the possible hidden states given this observation. In particular, first in Q3.1 and Q3.2, we want to compute P(x1,...,xL = O), the probability of observing the sequence. You should use the forward algorithm and the backward algorithm to achieve that. Then in Q3.3, we infer the most likely state path. Note that your code might be tested against different parameter sets/observation sequences at grading time.
Q3.1 Please finish the implementation of the functions forward() and backward() in hmm.py:
• forward(π,A,B,O) takes the model parameters π,A,B and the observation sequence O as input and output a numpy array α, where α[j,t] = αt(j) = P(Zt = sj,x1:t).
• backward(π,A,B,O) takes the model parameters π,A,B and the observation sequence O as input and output a numpy array β, where β[j,t] = βt(j) = P(Zt = sj,xt+1:T).
Please follow the slides 35, 36 in lec18.pdf for your implementation.
Q3.2 Now we can calculate P(x1,...,xL = O) from the output of your forward and backward algorithms. Please finish the implementation of the function seqprob forward() and seqprob backward() in hmm.py. Both of them should return P(x1,...,xL = O).
Q3.3 We want to compute the most likely state path that corresponds to the observation O.
Namely,
k ) = argmaxk P(sk1,sk2,··· ,skL|x1,x2,··· ,xL = O).
Please implement the Viterbi algorithm in viterbi() in hmm.py. The function viterbi(π,A,B,O) takes the model parameters π,A,B and the observation sequence O as input and output a list path which contains the most likely state path k∗ (in terms of the state index) you have found.
What to do and submit: After you finish each task in Q3.1/3.2 and Q3.3 in hmm.py, run the script hmm.sh. It will generate hmm.txt. Add, commit, and push both hmm.py and hmm.txt before the due date.
4 Dimensionality Reduction
In this question, you will implement t-Distributed Stochastic Neighbor Embedding (t-SNE) [1], a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. t-SNE maps high-dimensional data points x1,x2,...,xN into two or three-dimensional embeddings y1,y2,...,yN that can be displayed in a scatter plot. Unlike PCA that you learned in class, t-SNE is a non-linear dimensionality reduction technique. It is widely used (more than 3,300 citations and counting). If you are curious about how to use it effectively, check this out https://distill.pub/2016/misread-tsne/.
Intuitively, we want the low-dimensional data points to reflect similarities of their corresponding data points in the high-dimensional space. In other words, after applying the mapping, similar data points are still near each other and dissimilar data points are still far apart. In t-SNE, this is achieved in two steps. First, t-SNE constructs a joint probability distribution P over pairs of highdimensional data points in such a way that similar points have a high probability of being picked, whilst dissimilar points have an extremely small probability. Second, t-SNE similarly defines a joint probability distribution Q over pairs of low-dimensional data points and then minimizes the Kullback-Leibler (KL) divergence between P and Q. You could think of KL-divergence as a measure of how one probability distribution diverges from another.
Formally, let pij = P(xi,xj) and qij = Q(xi,xj). Then, we minimize
, (5)
where we define each of the above terms below. Please refer to the original paper [1] for justifications of their choice of probability distributions as well as their choice of objective.
We define the pairwise similarities pij between xi and xj in the high-dimensional space as
,where . (6)
where σi2 is the variance of a Gaussian distribution that is centered at xi. The conditional probability pj|i could be interpreted as the probability of xi picking xj as its neighbor.
Moreover, we define the pairwise similarities qij between yi and yj in the low-dimensional space with a Student t-distribution with one degree of freedom:
. (7)
Finally, we set both pii and qii to zeros, as we are only interested in pairwise similarities.
We use gradient descent to minimize the cost function in Eq. (5). The gradient is given by
. (8)
(Can you verify that this is indeed the formula for the gradient?)
Q4.1 We first apply PCA to our very high-dimensional data points. This in practice helps speed up the computation and reduce noise in the data points before we apply t-SNE. Finish the implementation of PCA in function pca in tsne.py to map the data points of dimensionality 784 to 50.
What to do and submit: Finish the implementation of function pca. Run the script pca.sh. It will output a file pca.npy. Add, commit and push the pca.npy and your modified tsne.py before the due date.
Q4.2 After PCA, finish the implementation of the t-SNE algorithm. In tsne.py, you are given the code to compute pij, and you need to finish the code that computes qij in function compute Q. You may need to refer to Eq. (7).
What to do and submit: Finish the implementation of function compute Q. Add, commit, and push the modified your modified tsne.py before due date.
Q4.3 Compute the gradient using Eq. (8) in function compute gradient in tsne.py. Finally, running the script tsne.sh will show cool t-SNE visualization the data in the 2-dimensional space.
What to do and submit: Finish the implementation of function compute gradient. Run the script tsne.sh. It will output a file Y.npy. Add, commit and push Y.npy and your modified tsne.py before due date.