$25
Many problems in bioinformatics amount to approximate string matching. For example, whenever a new gene is sequenced, searching known genes for close matches allows us to attempt to infer the new gene’s functions.
Part 1: Sequence Alignment
Specifically, a gene is a string over the alphabet Σ = {A,C,G,T}, and the closeness of two genes is measured by the extent to which they are aligned. An alignment of two strings, x and y, is an arrangement of their characters in columns, in the same orders as they originally appeared, potentially interspersed by spaces.
For example, given x = ACGAT and y = TACGCA, one possible alignment is:
A - - - C G A T T A C G C - - A
…where the ‘-’ character is used to indicate a space. Note that a typical further requirement is that no column may contain two spaces. Another possible alignment is:
- A C G - A T T A C G C A -
Of these two alignments, the second appears to be better — certainly, it contains more exact matches. The score of an alignment is specified by a (|Σ| +1) × (|Σ| +1) scoring matrix, δ. For example, the latter of the above alignments has score δ(-,T)+ δ(A,A)+ δ(C,C)+ δ(G,G)+ δ(-,C)+ δ(A,A)+ δ(T,-).
In your programming language of choice per Assignment 1, implement a dynamic programming algorithm to find the highest scoring alignment of two strings.
You may assume that both of the given strings will be non-empty and will contain only the characters ‘A’, ‘C’,
‘G’, or ‘T’. You may also assume that the scoring matrix will always contain exactly 5 rows and 5 columns, each in the order ‘A’, ‘C’, ‘G’, ‘T’, ‘-’, will always be symmetric about the diagonal, and will always provide integer scores.
For example, the above strings, along with a simple scoring matrix[1], would be represented as:
ACGAT TACGCA
x A C G T A 1 0 0 0 0 C 0 1 0 0 0 G 0 0 1 0 0
T 0 0 0 1 0
- 0 0 0 0 0
Since this scoring matrix only rewards exact matches, in this situation, the second of the above alignments would indeed be better than the first (and, in fact, is the highest scoring alignment overall).
Your program must accept as a command line argument the name of a file containing two strings and a scoring matrix as described above, then print to stdout the highest scoring alignment according to the following format:
· The first given string, assumed to be x, must be printed above the second given string, assumed to be y.
· The columns of the alignment must be space-separated, and ‘-’ characters[2] must be used to represent spaces within the aligned strings. For example:
$ ./compile.sh $ ./run.sh in1.txt
x: - A C G - A T y: T A C G C A -
Score: 4
You may further assume that the highest scoring alignment will be unique. Your program will be tested using diff, so its printed output must match exactly.
[1] of 2
[2] of 2