Starting from:

$30

CSCI3220-Assignment 1 Optimal Sequence Alignment Solved

In this assignment, you will first use dynamic programming to solve optimal sequence alignment problems, both manually and by writing a computer program. You will then study ways to skip some entries in the dynamic programming table that would not affect the final results.

 

 

 

Questions:

 

In Questions 1-2, we will use a scoring scheme of having +2 scores for a match, -1 score for a mismatch, and -2 scores for an indel without applying affine gap penalty.

 

1.       Fill in the following dynamic programming table to perform optimal global alignment between the sequences r=GTACC and s=CCTCC based on the scoring scheme stated above. Draw all arrows that lead to the score of each cell (i.e., only the “red arrows”). You can use the PowerPoint table template in the file CSCI3220_2018Fall_Assignment1_template.pptx if you want. Report the optimal alignment score and all alignments with this optimal score. When showing an alignment, indicate the names of the sequences clearly.  



Copyright © 2018 Kevin Yip                                                                                                          Page 1 of 5

 

 

 

2.       Repeat Question 1 but perform a local alignment instead of a global alignment. When showing an alignment, indicate the sequence names and the starting and ending positions of the aligned regions clearly.            



 

 

3.       Write a computer program in C, C++, Java or Python for performing optimal pairwise global alignments of DNA sequences using dynamic programming, without affine gap penalty. Your program should take the following inputs from stdin (not from a file) in the specified order, each occupying a different line:

 

i.   Score for a match (a positive integer) ii. Score for a mismatch (a negative integer) iii. Score for an indel (a negative integer) iv. First DNA sequence (a text string)

v. Second DNA sequence (a text string)

 

You do not need to check for errors in the inputs.

 

Your program should output to the screen (i.e., stdout) the highest alignment score and all optimal alignments in the following format:

 

<Best alignment score<line break

[<line break

<first sequence with gaps filled<line break

<second sequence with gaps filled<line break]

 

The part in square brackets repeats for each optimal alignment. During the traceback, whenever a table entry has multiple incoming arrows, the horizontal one should be the first to traceback,

2

followed by the diagonal one, and finally the vertical one. The resulting optimal alignments in the output should also follow this order. 

 

The non-comment portion of your program is expected to contain no more than 150 lines of code.

 

Here is a screen shot when a sample Java program called DP was run on an example from the lecture notes:

 

java DP

1

-1

-1

ATGCGT

ACGGCGT

3

 

A_TGCGT

ACGGCGT

 

AT_GCGT

ACGGCGT

 

ATG_CGT

ACGGCGT
 

Your program will be graded based on i) whether it can be compiled/run successfully, ii) whether it follows the input/output formats as specified above, iii) its accuracy on a number of test cases and iv) whether the program is well-documented with appropriate comments added to explain the meaning of the code. In view of the large size of our class, for easy grading, you may get zero score if your program cannot be compiled or it does not conform to the input/output formats specified.

 

Do not forget to use your program to check your answer to Question 1

 

4. In Question 1, we found the optimal global alignment(s) by filling in the whole dynamic programming table. However, given the specific scoring scheme we considered (+2 for a match, -1 for a mismatch and -2 for an indel), it is actually not necessary to fill in the whole table.

 

(a)   List all the entries in the dynamic programming table that can never be involved in an optimal global alignment between any two length-5 sequences given the above scoring scheme. Explain why these entries can never be involved in an optimal global alignment. (Hint: Consider what score each table entry needs to achieve in order to be involved in an optimal alignment, and whether the entry can actually achieve this score.)

 

(b)           Based on your answer in Part a, explain how the dynamic programming algorithm can be modified to skip the table entries that can never be involved in an optimal global alignment, assuming that the table is still a 6´6 two-dimensional array. Give as much specific detail as possible.           

(c)   [Optional] Now we generalize the results in Part a to two sequences of any lengths |r|=n and |s|=m³n, and any reasonable scoring scheme without affine gap penalty. Specifically,

3

 

 

2.       Repeat Question 1 but perform a local alignment instead of a global alignment. When showing an alignment, indicate the sequence names and the starting and ending positions of the aligned regions clearly.            



 

 

3.       Write a computer program in C, C++, Java or Python for performing optimal pairwise global alignments of DNA sequences using dynamic programming, without affine gap penalty. Your program should take the following inputs from stdin (not from a file) in the specified order, each occupying a different line:

 

i.   Score for a match (a positive integer) ii. Score for a mismatch (a negative integer) iii. Score for an indel (a negative integer) iv. First DNA sequence (a text string)

v. Second DNA sequence (a text string)

 

You do not need to check for errors in the inputs.

 

Your program should output to the screen (i.e., stdout) the highest alignment score and all optimal alignments in the following format:

 

<Best alignment score<line break

[<line break

<first sequence with gaps filled<line break

<second sequence with gaps filled<line break]

 

The part in square brackets repeats for each optimal alignment. During the traceback, whenever a table entry has multiple incoming arrows, the horizontal one should be the first to traceback,

2

consider a scoring scheme with a match score of smatch0, a mismatch score of smismatch<0, and an indel score of sindel<smismatch. Derive a formula that determines whether the entry on the i-th row and j-th column (first row and first column counted as 1) in the dynamic programming table can never be involved in any optimal global alignment. Don’t forget to use the formula you obtain for this general case to verify your answer to the special case in

                Part a.                                                                                                                    

 

 

5. This question is about FASTA.

 

(a)            Build a lookup table for the 2-mers in sequence r=ACGCGTCA. The table should be sorted by the 2-mers.           

(b)           Now you are given another sequence s=CCGTGCAC. List all maximal diagonal runs of length 2 or more in the dot plot of r and s by recording the starting and ending positions in r and s without actually making the plot. A diagonal run is maximal if it is not contained by another diagonal run.      

(c)   Suppose lookup(kmer) is a function that takes a k-mer as input and returns the list of all its occurrence locations in r as output. Give the pseudocode of an algorithm that can produce the output of Part b using this function

More products