Assignment: Project 3 – DNA Sequencer
Value: 80 points
1. Overview
In this project you will:
· Implement a linked-list data structure,
· Use dynamic memory allocation to create new objects,
· Practice using C++ class syntax,
· Practice dynamically allocated arrays,
· Practice object-oriented thinking.
2. Background
Deoxyribonucleic acid (DNA) is a molecule that carries the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms. Most DNA molecules consist of two strands coiled around each other to form a double helix. The two DNA strands are termed polynucleotides since they are composed of simpler monomer units called nucleotides. The nucleotides for DNA are made up of four bases - adenine (A), guanine (G), cytosine (C), and thymine (T).
DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. The four nucleotides (G, C, A, T) are paired. This means that if one strand of the DNA has a G, the other strand will have a C. If one strand has an A, the other strand will have a T (and vice-versa). If you have just one of the strands, you have the leading strand. If you have both strands, each pair of nucleotides is called a base pair. Base pairs will always be (A+T, T+A, G+C, or C+G).
The genetic code is the set of rules by which information encoded within genetic material (DNA or mRNA sequences) is translated into proteins by living cells. The code defines how sequences of nucleotide triplets (trinucleotides), called codons, specify which amino acid will be added next during protein synthesis. With some exceptions, a three-nucleotide codon in a nucleic acid sequence specifies a single amino acid.
In simple terms, this means that we can look at three nucleotides (G, C, A, T) and identify the amino acid that it is. For example, the trinucleotide “CAA” equates to glutamine and “AAA” equates to lysine.
Note: In biology, you may have learned about mRNA, and its function in the translation to an amino acid, however, for this project, we will not be dealing with the mRNA translation step (including using uracil instead of thymine).
3. Assignment Description
Your assignment is to build an application that can read a file of nucleotides into a linked list.
1. The file contains a sequence of nucleotides (G, C, A, T).
2. The file will always contain a multiple of 3 (due to that they are trinucleotides).
3. The file needs to be imported into a user made linked list.
4. The file is loaded via command line argument (included in the provided makefile and driver.cpp).
5. No static arrays or vectors are allowed in this project although dynamically allocated vectors are allowed.
6. All user inputs will be assumed to be the correct data type. For example, if you ask the user for an integer, they will provide an integer.
7. Regardless of the sample output below, all user input must be validated. If you ask for a number between 1 and 5 with the user entering an 8, the user should be re-prompted.
8. Have a main menu that asks the user if they want to:
a. What would you like to do?:
i. Display DNA (Leading Strand)
ii. Display DNA (Base Pairs)
iii. Inventory Basic Amino Acids
iv. Sequence Entire DNA Strand
v. Exit
1. Upon exit, nothing is saved
4. Requirements:
Initially, you will have to use the following files DNA.h, Sequencer.h, makefile, driver.cpp, Translate_to_DNA.cpp, and a variety of test input for the project. You can copy the files from Prof. Dixon’s folder at:
/afs/umbc.edu/users/j/d/jdixon/pub/cs202/proj3
To copy it into your project folder, just navigate to your project 3 folder in your home folder and use the command:
cp /afs/umbc.edu/users/j/d/jdixon/pub/cs202/proj3/*.* .
cp /afs/umbc.edu/users/j/d/jdixon/pub/cs202/proj3/makefile .
Notice the trailing period is required (it says copy it into this folder) and you need a second command to copy the makefile (*.* does not copy the makefile).
The Translate_to_DNA.cpp is a single function that belongs in DNA.cpp.
· The project must be completed in C++. You may not use any libraries or data structures that we have not learned in class. Libraries we have learned include <iostream, <fstream, <iomanip, <vector, <cstdlib, <time.h, <cmath and <string. You should only use namespace std.
· You must use the function prototypes as outlined in the DNA.h and Sequencer.h header file. Do not edit the header files.
· There are four test files available. They are proj3_numSize.csv. For example, proj3_60.csv has 60 nucleotides. proj3_15000.csv has 15,000 nucleotides. We should be able to test a file of any size (within reason). We provided test files with 9, 60, 3000, and 15000 nucleotides.
· You need to write the functions for the class (DNA.cpp) based on the header file (DNA.h). The nucleotides (i.e. Nodes) for the linked list that you are implementing are structs that hold two pieces of information – a char and a pointer to the next node. Do not use the STL for this project.
o DNA() – The constructor creates a new empty linked list. m_head and m_tail are always NULL and m_size is zero.
o ~DNA() – The destructor de-allocates any dynamically allocated memory. (May call clear)
o Clear() – Clears the linked list.
o InsertEnd() – Always inserts new nucleotides at the end of the linked list.
o Display() – Takes in a variable to know how many strands you want to display. 1 shows just the nucleotides that were loaded. 2 shows the nucleotides and their complements (G-C), (C-G), (T-A), or (T-A).
o IsEmpty() – Returns if the linked list is empty.
o SizeOf() – Populates m_size of sequencer with how many nucleotides were loaded.
o NumAmino() – Takes in the name and trinucleotide codon. Counts the number of instances of that trinucleotide codon in just the provided strand. For example, it could take Tryptophan and TGG or Phenylalanine and TTT. It then iterates over the structure to count how many instances of those amino acids exist in the DNA. Additionally, if we had the sequence T-T-T-T-G-G, we would have exactly 2 codons (TTT) and (TGG). The same if we had a sequence that was 15,000 nucleotides long. We would have exactly 5,000 trinucleotide codons. We never count overlapping codons. Run numAmino on at least Tryptophan (TGG) and Phenylalanine (TTT).
o Sequence() – Iterates over entire structure and converts trinucleotides to amino acids for all nucleotides in the file. Stores the amino acid name in a dynamic array. Displays amino acid list.
o Translate() – Converts a trinucleotide string to an amino acid name. It is available for download in my folder above and is named: Translate_to_DNA.cpp. Provided below.
string DNA::Translate(const string trinucleotide){
if((trinucleotide=="ATT")||(trinucleotide=="ATC")||
(trinucleotide=="ATA"))
return ("Isoleucine");
else if((trinucleotide=="CTT")||(trinucleotide=="CTC")||
(trinucleotide=="CTA")||(trinucleotide=="CTG")||
(trinucleotide=="TTA")||(trinucleotide=="TTG"))
return ("Leucine");
else if((trinucleotide=="GTT")||(trinucleotide=="GTC")||
(trinucleotide=="GTA")||(trinucleotide=="GTG"))
return ("Valine");
else if((trinucleotide=="TTT")||(trinucleotide=="TTC"))
return ("Phenylalanine");
else if((trinucleotide=="ATG"))
return ("Methionine");
else if((trinucleotide=="TGT")||(trinucleotide=="TGC"))
return ("Cysteine");
else if((trinucleotide=="GCT")||(trinucleotide=="GCC")||
(trinucleotide=="GCA")||(trinucleotide=="GCG"))
return ("Alanine");
else if((trinucleotide=="GGT")||(trinucleotide=="GGC")||
(trinucleotide=="GGA")||(trinucleotide=="GGG"))
return ("Glycine");
else if((trinucleotide=="CCT")||(trinucleotide=="CCC")||
(trinucleotide=="CCA")||(trinucleotide=="CCG"))
return ("Proline");
else if((trinucleotide=="ACT")||(trinucleotide=="ACC")||
(trinucleotide=="ACA")||(trinucleotide=="ACG"))
return ("Threonine");
else if((trinucleotide=="TCT")||(trinucleotide=="TCC")||
(trinucleotide=="TCA")||(trinucleotide=="TCG")||
(trinucleotide=="AGT")||(trinucleotide=="AGC"))
return ("Serine");
else if((trinucleotide=="TAT")||(trinucleotide=="TAC"))
return ("Tyrosine");
else if((trinucleotide=="TGG"))
return ("Tryptophan");
else if((trinucleotide=="CAA")||(trinucleotide=="CAG"))
return ("Glutamine");
else if((trinucleotide=="AAT")||(trinucleotide=="AAC"))
return ("Asparagine");
else if((trinucleotide=="CAT")||(trinucleotide=="CAC"))
return ("Histidine");
else if((trinucleotide=="GAA")||(trinucleotide=="GAG"))
return ("Glutamic acid");
else if((trinucleotide=="GAT")||(trinucleotide=="GAC"))
return ("Aspartic acid");
else if((trinucleotide=="AAA")||(trinucleotide=="AAG"))
return ("Lysine");
else if((trinucleotide=="CGT")||(trinucleotide=="CGC")||
(trinucleotide=="CGA")||(trinucleotide=="CGG")||
(trinucleotide=="AGA")||(trinucleotide=="AGG"))
return ("Arginine");
else if((trinucleotide=="TAA")||(trinucleotide=="TAG")||
(trinucleotide=="TGA"))
return ("Stop");
else
cout << "returning unknown" << endl;
return ("Unknown");
}
· You need to code up the various functions that are called in the Sequencer.cpp file that are prototyped in Sequencer.h.
o Sequencer() – The constructor builds the DNA (linked list), reads the file, and calls mainMenu.
o ~Sequencer() – The destructor de-allocates any dynamically allocated memory.
o ReadFile()– The ReadFile function loads a file of nucleotides into the DNA (linked list). The file itself is passed to the ReadFile function from the command line (in driver.cpp which is provided). Also, calls SizeOf to populate m_size.
o MainMenu() – Calls the various functions in the DNA (linked list).
§ Choices (1 and 2) – calls the DNA function Display.
§ Choice 3 – calls the DNA function NumAmino.
§ Choice 4 - calls the DNA function Sequence.
§ Choice 5 - Exits.
5. Sample Input and Output
5.1. Sample Run
A normal run of the compiled code would look like this with user input highlighted in blue:
m-bash-4.1$ make run1
./proj3 proj3_9.csv
New Sequencer loaded
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
1
Base Pairs:
A
A
G
T
G
G
C
T
A
END
9 nucleotides listed.
3 trinucleotides listed.
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
2
Base Pairs:
A-T
A-T
G-C
T-A
G-C
G-C
C-G
T-A
A-T
END
9 base pairs listed.
3 trinucleotides listed.
What would you like to do?:
Here are the runs looking at Inventory Basic (3) and Sequence Entire DNA (4):
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
3
Tryptophan: 1 identified
Phenylalanine: 0 identified
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
4
Amino Acid List:
Lysine
Tryptophan
Leucine
Total Amino Acids Identified: 3
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
5
DNA removed from memory
-bash-4.1$
Finally this is if you were going to validate a menu entry.
-bash-4.1$ make run1
./proj3 proj3_9.csv
New Sequencer loaded
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
0
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
6
What would you like to do?:
1. Display Sequencer (Leading Strand)
2. Display Sequencer (Base Pairs)
3. Inventory Basic Amino Acids
4. Sequence Entire DNA Strand
5. Exit
5
DNA removed from memory
-bash-4.1$
6. Compiling and Running
Because we are using a significant amount of dynamic memory for this project, you are required to manage any memory leaks that might be created. For a linked list, this is most commonly related to the dynamically allocated nodes. Remember, in general, for each item that is dynamically created, it should be deleted using a destructor.
One way to test to make sure that you have successfully removed any of the memory leaks is to use the valgrind command.
Since this project makes extensive use of dynamic memory, it is important that you test your program for memory leaks using valgrind:
valgrind ./proj3 proj3_60.csv
Note: If you accidently use valgrind make run1, you may end up with some memory that is still reachable. Do not test this – test using the command above where you include the input file.
If you have no memory leaks, you should see output like the following:
==5606==
==5606== HEAP SUMMARY:
==5606== in use at exit: 0 bytes in 0 blocks
==5606== total heap usage: 87 allocs, 87 frees, 10,684 bytes allocated
==5606==
==5606== All heap blocks were freed -- no leaks are possible
==5606==
==5606== For counts of detected and suppressed errors, rerun with: -v
==5606== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
The important part is “in use at exit: 0 bytes 0 blocks,” which tells me all the dynamic memory was deleted before the program exited. If you see anything other than "0 bytes 0 blocks" there is probably an error in one of your destructors. We will evaluate this as part of the grading for this project.
Additional information on valgrind can be found here: http://valgrind.org/docs/manual/quick-start.html
Once you have compiled using your makefile, enter the command ./proj3 proj3_9.csv to run your program. You can use make run1, make run2, make run3, and make run4 to test each of the input files. They have differing sizes. If your executable is not proj3, you will lose points. It should look like the sample output provided above.
7. Completing your Project
When you have completed your project, you can copy it into the submission folder. You can copy your files into the submission folder as many times as you like (before the due date). We will only grade what is in your submission folder.
For this project, you should submit these files to the proj3 subdirectory:
driver.cpp — should be unchanged.
DNA.h — should be unchanged.
DNA.cpp – should include your implementations of the class functions.
Sequencer.h — should be unchanged.
Sequencer.cpp – should include your implementations of the class functions.
As you should have already set up your symbolic link for this class, you can just copy your files listed above to the submission folder
b. cd to your project 3 folder. An example might be cd ~/202/projects/proj3
c. cp driver.cpp DNA.h DNA.cpp Sequencer.h Sequencer.cpp ~/cs202proj/proj3
You can check to make sure that your files were successfully copied over to the submission directory by entering the command
ls ~/cs202proj/proj3
You can check that your program compiles and runs in the proj3 directory, but please clean up any .o and executable files. Again, do not develop your code in this directory and you should not have the only copy of your program here.
For additional information about project submissions, there is a more complete document available in Blackboard under “Course Materials” and “Project Submission.”
IMPORTANT: If you want to submit the project late (after the due date), you will need to copy your files to the appropriate late folder. If you can no longer copy the files into the proj3 folder, it is because the due date has passed. You should be able to see your proj3 files but you can no longer edit or copy the files in to your proj3 folder. (They will be read only)
·If it is 0-24 hours late, copy your files to ~/cs202proj/proj3-late1
·If it is 24-48 hours late, copy your files to ~/cs202proj/proj3-late2
·If it is after 48 hours late, it is too late to be submitted.