$30
II. Project Description
A. Algorithm Description symbolsConsider the sets๐ฅ1, ๐ฅ2 , . , ๐ฅ๐ and ๐ consistsand๐ of the sequence of symbols๐ as representing the different๐ ๐ฆ1, ๐ฆ2 , . , ๐ฆ๐.
Suppose we are given two strings and , where consists of the sequence of
positions in the strings{1, 2,๐ and. , ๐๐}, and consider{1, 2, . ,a matching of these sets; Recall that๐}
a matching is a set of ordered pairs with the property that each item occurs in at are no “crossing” pairs: if ๐ ฯต ๐ and ๐ < ๐' , then ๐ < ๐'. Intuitively, most one pair. We say that a matching of these two sets is an alignment if there an alignment gives a way of lining up the two strings, by telling us which pairs of(๐, ๐) (๐', ๐') positions will be lined up with one another.
Our definition of similarity will be based on finding the optimal alignment
between and , according to the following criteria. Suppose is a given alignment between and :
1. First,position๐ thereof๐ ๐isora parameterthat๐ is not matchedδ๐ > 0 thatin defines— it isaa gapgap —penalty.๐we incurForaeachcost
of δ. ๐ ๐ ๐
2. Second, for each pair of letters in our alphabet, there is a mismatch
costappropriateof α๐๐ formismatchlining upcost๐ with๐๐for,.๐Thus,liningforupeachwith(๐, ๐) ฯต. ๐One, wegenerallypay the
assumes that = 0 for eachα๐ฅletter๐๐ฆ๐ —there is no๐ฅ๐ mismatch๐ฆ๐ cost to line up a letter with anotherα๐๐ copy of itself—although๐ this will not be necessary in anything that follows. alignment of minimum cost.๐
3.The cost of is the sum of its gap and mismatch costs, and we seek an
B. Input string Generator
The input to the program would be a text file containing the following information:
1.2. First base string (Next lines consist of indices after which the๐ 1) copy of the previous string needs to be inserted in the cumulative string. (eg given below)๐ 3.4. Second base string (Next lines consist of indices after which the๐ 2) copy of the previous string needs to be inserted in the cumulative string. (eg given below)๐
This information would help generate 2 strings from the original 2 base strings. This file could be used as an input to your program and your program could use the base strings and the rules to generate the actual strings. Also note that the
numbers and correspond to the first and the second string respectively. Make length and similarly,2๐ * ๐๐๐(๐ 2) need not be equal to . ๐ ๐
sure you validate the length of the first and the second string to beand ๐ ๐ . Please note that the base strings need not have to be of equal2๐ * ๐๐๐(๐ 1)
ACTG
3
6
1
TACG
1
2
9
Using the above numbers, the generated strings would be
ACACTGACTACTGACTGGTGACTACTGACTGG and
TATTATACGCTATTATACGCGACGCGGACGCG
Following is the step by step process on how the above strings are generated.
ACTG
ACTGACTG
ACTGACTACTGACTGG
ACACTGACTACTGACTGGTGACTACTGACTGG
TACG
TATACGCG
TATTATACGCGACGCG
TATTATACGCTATTATACGCGACGCGGACGCG
C. Values for Delta and Alphas
Values for α’s are as follows. δ๐ is equal to 30.
A
C
G
T
A
0
110
48
94
C
110
0
118
48
G
48
118
0
110
T
94
48
110
0
D. Programming/Scripting Languages
Following are the list of languages which could be used:
1. C
2. C++
3. Java
4. Python
5. Python3
E. Bounds
1. Basic Algorithm
0 <= ๐, ๐ <= 10
2. 1Memory Efficient Algorithm<= ๐๐๐๐ *(๐ ๐๐๐1), (๐๐๐๐ 1)(,๐ 22)๐ *<=๐๐๐2000(๐ 2) <= 2000
1 <= 2
0 <= ๐, ๐ <= 20
1 <= ๐๐๐๐ *(๐ ๐๐๐1), (๐๐๐๐ 1)(,๐ 22)๐ *<=๐๐๐20000(๐ 2) <= 20000 1 <= 2
III. Goals
Following are the goals to achieve for your project
A. Your program should take 2 arguments
1. input file path
2. output file path (If path is valid and file not found, your program should create it)
`python2 basic_2.py input.txt output.txt` `java Basic input.txt output.txt`
`python3 basic_3.py input.txt output.txt`
Note: As mentioned in Part II-B input file will have data to generate strings. Since Gap penaltythem in your program.(δ๐) and Mismatch penalty (α๐๐) are FIXED, you have to hardcode You are not allowed to use any libraries.
B. Implement the Dynamic Programming algorithm. Your program should print the following information at the respective lines in output file:
1. Cost of the alignment (Integer)
2. First string alignment ( Consists of A, C, T, G, _ (gap) characters)
3. Second string alignment ( Consists of A, C, T, G, _ (gap) characters )
4. Time in Milliseconds (Float)
5. Memory in Kilobytes (Float)
Note: There can be multiple alignments which have the same cost. You can print ANY alignment generated by your program. The only condition is it should have a minimum cost.
e.g. For strings ๐ 1: A and ๐ 2: C, alignments A_, _C and _A, C_ both have alignment cost 60 which is minimum. You can print any one of them.
C. Implement the memory efficient version of this solution and repeat the tests in Part B.
D. Plot the results of Part B and Part C using a line gr