Starting from:

$30

EECS730-Homework 1 Solved

Project 1: Retrieving the Protein Sequences from the Human Genome 

 

Objectives: 

1.       Be familiar with the human genome and major genome annotation databases 

2.       Be familiar with the central dogma 

3.       Be familiar with alternative splicing and the codon table 

4.       Be familiar with the FASTA format for storing biological sequences 

 

Task: 

       Retrieve the sequences of all proteins encoded in the human genome. 

 

Hits: 

(1): Explore the UCSC (U. California Santa Cruz) Genome Browser website

(genome.ucsc.edu). Try to find where to download the human genome. (If you can’t, here is the link: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz)

 

(2): Use the Table browser of the website to obtain human genome annotation. (From the top bar, under “Tools”, select “Table Browser”).

 

(3): Make the following selection:

  

            clade: Mammal           genome: Human

            assembly: Dec. 2013 (GRCH38/hg38)           group: Genes and Gene Preditions     track: NCBI RefSeq

 table: RefSeq All (ncbiRefSeq) (I strongly recommend you to click “describe table schema” to understand the meaning of the table. This is where I will direct you to if you ask me what does each field of the table mean.)

            region: genome           output file: [make you own selection] and then click “get output”.

 

(4): Obtain the human codon table from https://www.genscript.com/tools/codonfrequency-table. Note that you need to select “Human” from “Expression Host Organism”.  

 

(5): Write a script to obtain all protein sequences coded in the human genome. Your output should be in the multiple FASTA format, which looks like:

 

>ID1

Sequence 1… >ID2

Sequence 2…

 

The ID field describes what the sequence is. You should use the concatenation (with colon

“:” as the delimiter) of the RefSeq table name1 and name2 fields as the ID. For example, for the first record in the RefSeq table, the corresponding ID should be

“>NM_001276352.2:Clorf141”.

 

The sequence field simply records the corresponding sequence, all in one line. For example:

MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGS AQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHC

LLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

More products