Starting from:

$25

CSCI3003-Lab Assignment 5 Analyzing SNP Data Solved

Part I: Practice with Functions  
Write a function to check if a string represents a valid Homo sapiens genome locus identifier,

e.g. 6p21.3, 11q1.4, 22p11.2. A human locus name consists of a number between 1 and 22 or an X or Y, p (short) or q (long) denoting a chromosomal arm, a band number, a period, and a sub-band number).  Your function name should be: is_valid_humanlocus

Assume that the function would be used in the following context, where we’d like to check if a string has a particular pattern, and if so, execute a set of statements:  

 

if  is_valid_humanlocus(string):  

# <do something here  

 

Your function should return a boolean value.

 

Once you have defined your function, write code that uses assert statements to test your function. Specifically, do each of the following: 

a.     Add assertion statements to your script to check that your function returns True for the following examples:  '6p21.3', '11q1.4', and '22p11.2'

b.     Add assertions to check that your function returns False for the following examples:

'chr1:1000', 'nonsense', and '2a11p'

c.      Write two additional assertion statements that check invalid examples that you come up with on your own and explain why you chose them (e.g. is an element out of range, did you expect a number here?, etc)

 

Part II: Analyzing SNP Data
  

For this problem you will analyze data from sets of single nucleotide polymorphisms (SNPs) that commonly vary in the human population. There are two datasets, extracted from

http://23andme.com, one from the fictitious male, Greg Mendel, and the other from his wife, Lilly Mendel.  

a.     The data in these files are poorly formatted; you will need a set of Python string expressions to properly extract all of the information. Parse out the SNP id, chromosome, position and SNPs for each row. For example the first row,

rs3094315chr1-742429(A,G) could be parsed to:

 

               id            Chr          Position     SNP1 

SNP2 rs3094315 1 742429 A G 

 

 

b.     Once you’ve finished part (a), use your code to define a function called read_SNP_file, which you then call from your main script to process both Greg and

Lilly’s data.  The function should accept a string with the file name as an argument and return a data structure with all of the individual’s SNP information.  Also, add an assert statement inside this function to guarantee that the chromosome number is valid (we’ve only given you the data from the autosomes, so all SNPs should be on chromosomes 122).

            Hint: a dictionary for each person, each one containing 4 parallel lists (e.g. the key “Chr” is associated with a list with the chromosome values, the key “Position” is associated with a list of the position values) is a reasonable data structure for this type of data.

  

c.      On Chromosome 10, find the largest region of shared SNPs between Lilly and Greg. The answer will be in the form of a pair of genomic coordinates (Position1, Position2). Below is an example of a region of shared SNPs (in bold). In this case, report the shared region as (31123, 31625).

 

               Chromosome       Position         Lilly             Greg 

               10               31,000           AA                AT 

               10               31,123           TT                TT 

               10               31,319           AT                AT 

               10               31,625           CC                CC 

               10               31,779           GA                CC 

 

(Hint: if you’ve left your SNPs in genome position order in your lists, you can iterate through the list to find stretches of SNPs that are identical)

 

d.     The SNP_Definitions.txt file contains information about the effects of various SNPs. Load the SNP definitions into a data structure so that you can look up a description given a SNP id and the bases. (HINT: use a dictionary with the SNP id as the key)

 

e.     Use the information you read in from SNP_Definitions.txt to identify what the region between 22070000 and 22106000 on chromosome 9 suggest about Greg’s chance of heart disease? What about Lilly’s chance of heart disease? (Hint: find the SNPs from this region, and use the information from the ‘Description’ column to guide your reasoning)

 

 

f.       Find a SNP locus that interests you at SNPedia.com. Describe what is known about the locus. Also, check what the SNP status is in both Lilly and Greg. What does the SNP suggest about their health?

 

More products