$25
This lab will focus on statistical analysis of gene expression data using both microarray and RNA-seq datasets. Specifically, we will investigate genes whose expression is different between estrogen receptor (ER) positive and ER-negative breast cancer tumors. The ER status reflects whether a tumor will be responsive to hormone therapies and is often used as an important diagnostic in determining the course of treatment. In this lab, we will examine gene expression differences between breast cancer tumor samples using both microarray and RNA-Seq expression data.
Part I: Understanding the breast cancer study
This week’s lab will focus on understanding two expression datasets, loading the data into R, and performing some manipulations on the data. Before we begin doing statistical analyses of the gene expression data, it is important that you take some time to understand the study. We will be examining expression data generated from microarrays as well as from RNA-Seq. These datasets were generated from two studies (cited at the end of this document), and the data have been downloaded from the NCBI Gene Expression Omnibus and the NCBI Short Read Archive.
Answer the following questions:
a) What was the purpose of each study and what experiments did the authors do to answer the question they were interested in?
b) How many gene expression samples are in each dataset?
c) For each study, what were the two groups of samples being contrasted to measure differential gene expression?
Part II: Loading and manipulating expression data in R
To save you time, we have already downloaded the data from NCBI and prepared data files for each of the studies mentioned above. These data files can be downloaded from the Canvas Website.
Write a script that will do the following:
a) Load the microarray gene expression files into R matrices using the read.table() function. We will want to use a matrix to interact with the data, so create a matrix from the data frame you just made using the as.matrix() function. In the matrix, we want each column name to be a tumor ID and each row name to be a gene name.
● BC_MicroArray.txt - Microarray expression data: 32,864 probes (genes) x 90 samples (the values are log-transformed intensities from an Affymetrix array)
● BC_MicroArray_status.txt - The status labels for the samples (ER+ or ER-)
b) Write a short series of statements that will check the structure of the data matrices you’ve just loaded. The microarray data should contain values for 32,864 probes (genes) by 90 samples. Hint: use the “str” function we discussed during lectures.
c) Print the data for the BRCA1 gene for all samples.
d) Print the values for the first 10 genes for the first ER+ sample (GSM519791) in the microarray dataset.
e) Compute the mean and standard deviation of the expression data for the BRCA1 gene across all samples.
f) Compute the mean and standard deviation of the expression of all genes in the first ER+ sample (GSM519791).
g) Print the list of 10 genes with the highest expression values in order of decreasing expression based on the first ER+ sample (GSM519791).
h) Create a vector called ERpos_samples that contains the GSM names of all ER+ samples.
i) Create a vector called ERneg_samples that contains the names of all ER- samples.
j) Use the vectors created in Parts (h) and (i) to compute a new vector expr_difference, which contains the difference in the mean expression for each gene between the ER+ and ER- sample groups. Positive values in this vector should indicate where mean ER+ expression is greater than mean ER- expression, and negative values should indicate where ER- < ER+ expression (Hint: matrices can be indexed by name and there are functions that compute means on matrix rows and columns).
k) Using the expr_difference vector you just created, print the gene that has the largest positive difference (i.e. ER+ ER-) and the largest negative difference (i.e. ER+ < ER-) in mean expression between the ER+ and ER- groups. (Hint: you may find the names function useful here)
Part III: Statistical analysis of microarray gene expression data
For this part of the lab, we will find a subset of the genes that are differentially expressed using the t-test. The purpose of the t-test is to statistically assess a gene’s difference in expression across two groups of samples (in this case, ER+ and ER-), and answer the yes/no question “is this difference significant?”
To perform a t-test to determine if one gene’s differential expression is significant, you must first calculate a t-statistic. For a single gene, this is:
𝑡
where 𝑋1 is the mean of the ER+ expression levels for that gene, 𝑋2 is the mean ER- expression level, 𝑁1 and 𝑁2 are the number of samples per group (ER+/-), 𝑠𝑥21 is the variance of the ER+ expression values for that gene and 𝑠𝑥22 is the variance of the gene’s ER- expression.
a) Create a vector of t-statistics measuring the difference in expression for each gene between the ER+ and ER- samples (this vector should be 32864 x 1 or 1 x 32864). (Hint:
the functions mean(), var(), and length() or dim() will be useful).
b) Use the hist() function to plot a histogram of the t-statistics you just computed. (Hint:
hist(myvect,breaks=100)) will create a histogram of the values in myvect with 100 bins.
c) Your goal is to answer the yes/no question of whether or not each gene’s expression value is significantly different from the random expectation in each sample. To do this, you will need to compute a p-value that reflects the significance of the t-statistics you just measured. A p-value represents the probability that the observed t-statistic was derived from samples where there was no significant difference in the means of the two groups. Thus, a small pvalue reflects a case in which the gene is very likely differentially expressed between the two sets of samples. Typically, a cutoff of 0.05 is used; genes with less than a 0.05 p-value are called significantly differentially expressed.
Modify the code you created above to also compute a vector of p-values, one for each gene. (Hint: you will need to use the pt() function here, which computes the integral under the tdistribution for you; N = N1+N2) Because we want to test for positive and negative differences between the two groups at the same time, we will use a two-sided t-test, which computes the integral under the t-distribution for the ranges |𝑡_𝑠𝑡𝑎𝑡| and |𝑡_𝑠𝑡𝑎𝑡| →
. Remember, this is how we compute the probability of observing, under the null
hypothesis that the gene is not differentially expressed, a value at least as extreme as the one you observed (on either side of the t-distribution). You can compute this probability using built-in R functions as follows:
pval = 2 * pt(-abs(tstat),N-2) where N is the total number of tumor samples (both ER+ and ER-).
(pval = pt(tstat,N-2) is appropriate for performing a one-sided t-test)
d) Extra credit (2pts):
● Explain in detail why this works for computing the p-value for a one-sided test (1pt): pval = pt(-abs(tstat),N-2);
● Explain in detail why this works for computing the p-value for a two-sided test (1pt): pval = 2 * pt(-abs(tstat),N-2);
e) Using the results you computed for parts 1-3, answer the following questions by writing short sections of R code:
i.) Print a list of significantly differentially expressed genes (p-value < 0.05), sorted in order of their significance (from smallest to largest). Include the gene name, the tstatistic, and the corresponding p-value in your list. Hint: either the order function, or the sort function with the index.return=TRUE option, will be useful. ii.) How many genes are significantly differentially expressed at a p-value < 0.05? How many genes are significantly differentially expressed at a Bonferroni-corrected pvalue < 0.05?
iii.) How many genes are significantly differentially expressed and expressed more highly in ER+ tumors relative to ER- tumors? iv.) How many genes are significantly differentially expressed and expressed more lowly in ER+ tumors relative to ER- tumors?
v.) Select a gene from the top 10 most significantly differentially expressed and describe what’s known about its function (http://www.genecards.org will be useful).
Part IV: Using R Packages to Process RNA-Seq Data
As we discussed in class, RNA-Seq technology is the current technology of choice for measuring gene expression profiles. Typical RNA-Seq datasets contain hundreds of millions of raw sequencing reads that need to be processed with specialized methods in order to perform statistical analysis. Here, we have processed an RNA-Seq breast cancer data set comparing ER+ and ER- gene expression differences into raw read counts per gene. This allows you to run a comparable differential expression test based on RNA-Seq data. In this portion of the lab, you will download and run an R package that will identify differentially expressed genes related to ER+/- status.
Open the provided script, Part4.R, and run the code. You will need to modify/add code for parts b and f. For the remaining parts, just study the code that we’ve already written for you.
a) Run the code to download and install DESeq2
b) Add code to read in RNA-Seq files:
● BC_RNAseq.txt – RNA-Seq expression count data (18517 genes X 7 samples)
● BC_RNAseq_status.txt – RNA-Seq expression sample status labels (ER+ or ER-)
c) Convert status table into a factor and create DESeq2 Count Data Set
d) Normalize for size effects and estimate dispersion of the data
e) Calculate differential expression
f) Print the most significant 100 genes based on their adjusted p-value (adjusted p-value is equivalent to the False Discovery Rate (FDR)). What is the corresponding FDR for these 100 genes?
g) Extra credit : how many of the top 500 most significantly differentially expressed genes overlap between the microarray and RNA-Seq analyses? Comment on the degree of overlap—is this what you expected? Note: you will need to clean the RNAseqderived gene names before matching them to the microarray-derived gene names.