$25
The Normal Distribution
Recall from your probability class that a random variable X is normally-distributed with mean µ and variance σ2 (denoted X ∼ N(µ,σ2)) if it has a probability density function, or pdf, equal to
f .
In R we can simulate N(µ,σ2) random variables using the rnorm() function. For example,
rnorm(n = 5, mean = 10, sd = 3)
## [1] 8.120639 10.550930 7.493114 14.785842 10.988523
outputs 5 normally-distributed random variables with mean equal to 10 and standard deviation (this is σ) equal to 3. If the second and third arguments are ommited the default rates are mean = 0 and sd = 1, which is referred to as the “standard normal distribution”.
Tasks
Sample means as sample size increases
1) Generate 100 random draws from the standard normal distribution and save them in a vector named normal100. Calculate the mean and standard deviation of normal100. In words explain why these values aren’t exactly equal to 0 and 1.
# You'll want to type your response here. Your response should look like:
# normal100 <# Of course, your answer should not be commented out. 2) The function hist() is a base R graphing function that plots a histogram of its input. Use hist() with your vector of standard normal random variables from question (1) to produce a histogram of the standard normal distribution. Remember that typing ?hist in your console will provide help documents for the hist() function. If coded properly, these plots will be automatically embedded in your output file.
3) Repeat question (1) except change the number of draws to 10, 1000, 10,000, and 100,000 storing the results in vectors called normal10, normal1000, normal10000, normal100000.
4) We want to compare the means of our four random draws. Create a vector called sample_means that has as its first element the mean of normal10, its second element the mean of normal100, its third element the mean of normal1000, its fourth element the mean of normal10000, and its fifth element the mean of normal100000. After you have created the sample_means vector, print the contents of the vector and use the length() function to find the length of this vector. (it should be five). There are, of course, multiple ways to create this vector. Finally, explain in words the pattern we are seeing with the means in the sample_means vector.
Sample distribution of the sample mean
5) Let’s push this a little farther. Generate 1 million random draws from a normal distribution with µ = 3 and σ2 = 4 and save them in a vector named normal1mil. Calculate the mean and standard deviation of normal1mil.
6) Find the mean of all the entries in normal1mil that are greater than 3. You may want to generate a new vector first which identifies the elements that fit the criteria.
7) Create a matrix normal1mil_mat from the vector normal1mil that has 10,000 columns (and therefore should have 100 rows).
8) Calculate the mean of the 1234th column.
9) Use the colSums() functions to calculate the means of each column of normal1mil_mat. Remember, ?colSums will give you help documents about this function. Save the vector of column means with an appropriate name as it will be used in the next task.
10) Finally, produce a histogram of the column means you calculated in task (9). What is the distribution that this histogram approximates (i.e. what is the distribution of the sample mean in this case)?
11) Let’s push this even farther. Generate 10 million random draws from an exponential distribution with rate parameter λ = 3 (Hint: ?rexp). Save the simulated draws in a vector named exp_10mil. Calculate the mean and standard deviation of exp_10mil. How do these numbers compare to E(X) = 1/3 and sd(X) = 1/3?
12) Create a matrix exp10mil_mat from the vector exp_10mil that has 10,000 columns (and therefore should have 100 rows). Use the colMeans() function to calculate the means of each column of exp_mil_mat. Show the first 10 computed means.
13) Finally, produce a histogram of the column means you calculated in task (12). What is the approximate distribution that this histogram displays (i.e. what is the distribution of the sample mean in this case)? Overlay the true approximate density function over the histogram. Note: the correct code is displayed below.
# hist(exp_means,
# main="Histogram of Exponential Means",
# xlab=expression(bar(X)),
# prob = T,
# breaks=20)
# n <- nrow(exp10mil_mat)
# mean_exp <- 1/3
# mean_exp
# sd_exp <- 1/(3*sqrt(n))
# sd_exp
# x <- seq(0,1,by=.0001)
# my_density <- dnorm(x,mean=mean_exp,sd=sd_exp)
# lines(x,my_density,col="purple")