Starting from:

$30

CSCI3220-Assignment 3 Solved

In this assignment, you will use different ways to compute sequence similarity based on g-gapped kmers, to experience their relative. You will then perform some probability derivations using the three basic rules, and solve some generic problems. Finally, you will implement the forward algorithm for computing data likelihoods according to a first-order hidden Markov model, in a way that your program can handle any type of sequences. 

 

 

Questions:

 

1.       This question is about g-gapped k-mers. Suppose we want to compute a similarity score between DNA sequences s1=GCGTCCGAC and s2=CGACGCGAC based on 1-gapped 3-mers (i.e., g=1, k=3).

 

(a)            How many different 1-gapped 3-mer patterns are there for DNA sequences? Show the steps in your calculation.       

(b)   List all the 1-gapped 3-mers actually supported by s1 and s2 and their occurrence counts in the two sequences by filling in the following tables (add more rows if needed). The tables should be sorted in ascending order of the 1-gapped 3-mers, where the wildcard character * is ordered before the four nucleotides.    Table for s1         Table for s2

1-gapped 3-mer
No. of occurrence
 
1-gapped 3-mer
No. of occurrence
 
 
 
 
                                                       

 

(c)   Based on your tables in Part b, compute the similarity between the two sequences, defined as the inner product of the two 1-gap 3-mer occurrence vectors. Show clearly the common

                1-gap 3-mers of the two sequences in your calculations.                                     

 

(d)              Now complete the following table listing the number of mismatches among the 4-mers in the two sequences. The row and column headers should be the 4-mers present in s1 and s2, respectively, both sorted ascendingly. If a 4-mer appears multiple times in a sequence, repeat it that number of times in the row/column headers.

 

(e)            Based on your result in Part d, compute the similarity score between s1 and s2 again. Show the steps in your calculation. Do not forget to compare with your result in Part c.  

(f)            In practice, it is not easy to decide which values of g and k to use. One possible approach is to try multiple combinations of these values, and use some independent knowledge to evaluate which combination leads to the best similarity scores. Propose two ideas for reducing the total computational time as compared to repeating the above calculations for each combination of g and k values.        

2.       This question is about statistical modeling. Suppose Y is a binary variable, and we want to construct a model of Y based on binary features X1 and X2, which may not be conditionally independent.

 

(a)   Derive a formula for computing Pr(Y=0|X1=1,X2=1) in terms of Pr(X1=1), Pr(X1=1,X2=0) and Pr(X1=1,X2=1,Y=1). Show your derivations step by step. (6%)

 

(b)   [Optional] In general, with three variables X1, X2 and Y, there are various possible probabilities of the form Pr(VL=vl|VR=vr), where VL and VR are two disjoint ordered sets of variables (and VR can be empty), and vl and vr are their corresponding values. For example, in the case of Pr(X1=1), VL=(X1), vl=(1), and VR= Æ. In the case of Pr(Y=0|X1=1,X2=1), VL=(Y), vl=(0), VR=(X1,X2), and vr=(1,1).

 

Propose an algorithm that can take one of these probabilities as target (such as Pr(Y=0|X1=1,X2=1)) and some of these probabilities as input (such as Pr(X1=1), Pr(X1=1,X2=0) and Pr(X1=1,X2=1,Y=1)), and determine whether the input is sufficient for inferring the value of the target.           

 

(c)            Now assume X1 and X2 are independent when conditioned on Y. Give a formal definition of this assumption using mathematical symbols.

(d)           Depending on the values of X1, X2 and Y, there are 8 probabilities of the form Pr(X1,X2|Y). Given an example set of values for these 8 probabilities such that X1 and X2 are not independent when conditioned on Y. Prove that the two variables are not conditionally independent.            

(e)            If two variables X1 and X2 are independent when conditioned on Y, is it guaranteed that they are also unconditionally independent? Explain why or why not.

(f)  If two variables X1 and X2 are unconditionally independent, is it guaranteed that they are also independent when conditioned on Y? Explain why or why not

 

3.       Write a computer program called Forward in C, C++, Java or Python that implements the forward algorithm to compute the data likelihood of a given sequence based on a first-order hidden Markov model. This program should be able to handle any number of states and any number of emission symbols. Your program should take the following inputs in the specified order, each on a different line: • The number of states (an integer)

•        The initial probabilities of the states (floating-point numbers, one per line), in the order of p1, p2, …

•        The transition probabilities among the states (floating-point numbers, one per line), in the order of p11, p12, …, p21, p22, …

•        The number of emission symbols (an integer)

•        The emission symbols (characters, one per line), in the order of b1, b2, … 

•        The emission probabilities (floating-point numbers, one per line), in the order of e1(b1), e1(b2), …, e2(b1), e2(b2), …

•        The sequence the likelihood of which is to be computed (a string)

 

You can assume the input data are properly formatted and do not need to check for errors.

 

The output should be a single floating-point number of the likelihood value. Due to the inexact nature of floating-point numbers, when we check your answers a small amount of leeway may be considered.

 

The non-comment portion of your program is expected to contain no more than 200 lines of code.

 

Here is an expected screen shot when a Java program is run on an example from the lecture notes:

 

java Forward

2

0.5

0.5

0.9

0.1

0.8

0.2

4

A

C

G

T

0.5

0.5 0

0

0.25

0.75 0

0

CAC

0.15156250000000002
 

Your program will be graded based on i) whether it can be compiled/run successfully, ii) whether it follows the input/output formats as specified above, iii) its accuracy on a number of test cases and iv) whether the program is well-documented with appropriate comments added to explain the meaning of the code.       

More products