Starting from:

$30

CS262 Assignment 2-DNA Sequencing Solved



Introduction
For this assignment, you will be simulating the manipulation the nucleotides in strands of DNA.  You do not need to know much about DNA to do this assignment, but if you would like to learn more about the process, this webpage has more information.

Background
A nucleotide base is a fundamental unit that creates a genetic code found in a DNA strand (RNA will be saved for a future project). These these bases are represented by four letters which are:

A for adenine
C for cytosine
G for guanine
T for thymine
A chain of these nucleotide bases creates a Nucleic Acid which contains information used by living cells to construct proteins, which in turn contains the information that living organism need to survive and reproduce.  The representation of these chains can be quite long.  An example of such chains can be found here.  As part of genetic engineering, these chains are cut, and a sequence from another organism is inserted, or perhaps a section of the chain is excised, or removed from the larger chain.

The program you write will simulate the splicing and chains of nucleotide bases. Since there are four different letters, each letter can be represented by two bits:

A  00
C  01
G  10
T  11
So, for example, the nucleotide sequence GATTACA can be represented by the number sequence 2033010 (or decimal 9156 or binary 10 00 11 11 00 01 00).

You will be using bytes (unsigned char) to hold the nucleotide bases.  Each byte can hold up to four letters.  You will use bitwise operators (shifts, bitwise &, |, ^, ~) to manipulate the bits in a byte to add, move, or change nucleotide letters.



Specifications

Your program will use the following structure to hold a nucleotide chain:

#define MAX_CHAIN_BYTES  100

typedef struct _Chain
{
     size_t  SeqLen;  // Number of letters in sequence
      unsigned char Sequence[MAX_CHAIN_BYTES];
} Chain;

With this structure, a sequence can hold up to 400 different letters of a DNA sequence.  The representation (and order) of the letters will be in the smallest to largest byte in the Sequence array.  However, within a single byte, the letters will be ordered from Most Significant Bit (MSB - the leftmost bit) to Least Significant Bit (LSB - the rightmost bit).  In other words, if a sequence contains just two letters - CT, then the representation of this sequence will be found in the first byte of the Sequence array (Sequence[0]), and will have the bit pattern  0111xxxx, where the x's can be either 0 or 1 since they will not be part of the overall sequence.  (Note: Even though the unused bits can be either 0 or 1, you may find it advantageous to set them to 0.)

You are to write a menu driven program to alter (splice, excise, or replace) sequences. The menu will contain the following options:

Read a DNA sequence from a file
Save the current sequence to a file
Print the current sequence
Splice and insert a sub-sequence
Remove a sub-sequence
Replace a sub-sequence with another sub-sequence
Exit the program
User input to the menu will consist of the numbers 1-7.  Do not use any additional or different input values. Descriptions of the functionality of each input is as follows:

When this option is selected, prompt the user to enter a filename containing a DNA sequence.  This file will be in binary format, and contain the data for a single Chain structure.  If the file can be opened successfully, the data in the file is used to initialize a variable of type Chain.  Otherwise, an error message is written to the screen, and the program returns to the main menu.  Note: This option must be chosen before choosing options 2-6.  You should check that a file was successfully read before allowing options 2-6 to be executed.
When this option is selected, prompt the user to enter a filename in which to save the data in the Chain struct containing the current sequence.  The format of the file should be the same as described for option 1 (a single Chain structure in binary format).  If the output file cannot be opened successfully, an error message is written to the screen, and the program returns to the main menu.
When this option is selected, the sequence of letters is printed to the screen.
When this option is selected, your program will prompt the user for two things:
The sub-sequence to insert - This will be a string of Nucleotide letters, and will be read from the console (not a file).  Save this input in a Chain struct variable. (Note: Use strlen() to find the length of the input string, and DON'T forget to remove the trailing '\n').
The place to insert the sequence
The sub-sequence will be placed after ALL instances of the given place to insert. For example:
Suppose the current sequence is: CATAGGTACCAGGTACA
The sequence to insert is: ACATGA
The place to insert is:  GGT
Your program will search for all instances of GGT (shown in bold) in the current sequence
CATAGGTACCAGGTACA
and insert the sub-sequence (shown in lower case for this example) after each of those instances:
CATAGGTacatagaACCAGGTacatagaACA
So, the result after this insertion will be:
CATAGGTACATGAACCAGGTACATGAACA

Note:  If the sub-sequence happens to contain the subsequence after which to insert, it will NOT be included in the insertion (What can happen if it is included as the place to insert?)
When this option is selected, your program will prompt the user for a sub-sequence.  This sub-sequence will be entered from the console (not a file).  Your program will then search for the given sub-sequence throughout the entire current sequence and remove ALL instances of the given sub-sequence.  For example, if the current sequence is:
CATAGGTACATGAACCAGGTACATGAACA
and the entered sub-sequence is:
GGTA
the resulting sequence is:
CATACATGAACCACATGAACA
 
When this option is selected, your program will prompt the user for a sub-sequence to remove.  This sub-sequence will be entered from the console (not a file).  The program will then prompt the user for a sub-sequence to replace the removed sub-sequence. This sub-sequence will also be entered from the console. Note that the two sequences do not necessarily have to be the same length.  Your program will then search for the first sub-sequence throughout the entire current sequence and replace ALL instances of this sub-sequence with the second sub-sequence.  For example, if the current sequence is:
CATAGGTACATGAACCAGGTACATGAACA
the sub-sequence to remove is:
GGTA
and the replacement sub-sequence is:
AACGTGA
the resulting sequence is:
CATAAACGTGACATGAACCAAACGTGACATGAACA
If this option is selected, print an appropriate closing message, and exit the program.
As in past assignments, part of the grading of your code will be performed by a script (sample scripts will be provided), and any additional or missing prompts will cause your program to fail to run correctly.

Other Specifications and Additional Information

You must include a Makefile to compile your project.
Portions of the project will be graded using a script (a sample will be provided).  It is important that your program works with any sample scripts.  Otherwise, your overall score for the project may be lowered considerably.
If the size of the original sequence plus a modification is greater than the maximum length of a sequence, truncate the end of the sequence so that it is no longer than 4 * MAX_CHAIN_BYTES.
For options 4, 5, and 6, if the entered sub-sequence (for replacement, removal, or insertion after) cannot be found in the current sequence, the sequence should not be modified, and an informational message (such as "sub-sequence not found") should be printed to the screen.
If an input sequence that is read from the console contains any characters other than A, C, G, or T, print an error message and re-prompt for a new sequence until a correct sequence is entered.
Data in the binary files is assumed to be correct (by nature of the previous two bullets).
As always, the use of global variables and variable length arrays are forbidden.
All source and tarfiles should follow the standard CS262 naming conventions and contain appropriate comments.
Some Helpful Hints:

Although the data must be read and written from/to the files in binary format, you can perform most of the other operations using cstrings.  However, you will have to convert the data from or to binary when reading or writing the files.
You may find the following String Library functions useful (check man pages for proper usage):
strstr()
strcpy()
strlen()
strcat()
You can index within portions of a string by using the & (Address-of) operator.  For example, to remove the 7th and 8th character from a string named str using strcpy, you can use the following function call:
strcpy(&str[6], &str[8]);
Strategic use of the NULL character ('\0') after copying sequences to temporary cstrings may help with insertion and deletion of sub-sequences.

More products