$25
Introduction
This document describes your sixth assignment for Statistical Computing and Empirical Methods (Unit EMATM0061) on the MSc in Data Science. Before starting the assignment it is recommend that you first watch video lectures 13 and 14.
Begin by creating an Rmarkdown document with html output. You are not expected to hand in this piece of work, but it is a good idea to get used to using Rmarkdown.
β > 1
1 A Gaussian model for Red tailed hawks
In this question we will fit a Gaussian model to a Red-Tailed hawk data set.
First load the Hawks data set as follows:
library(Stat2Data) data("Hawks")
Now use your data wrangling skills to filter extract a subset of the Hawks data set so that every Hawk belongs to the “Red-Tailed” species, and extract the “Weight”, “Tail” and “Wing” columns. The returned output should be a data frame called “RedTailedDf” with three numerical columns and 577 examples.
Display the first five rows of the “RedTailedDf”. The resulting subset of the data frame should look as follows:
## Weight Tail Wing
## 1 920 219 385
## 2 930 221 376
## 3 990 235 381
## 4 1090 230 412
## 5 960 212 370
We now model the vector of tail lengths from “RedTailedDf” as a sequence X1,··· ,Xn consisting of independent and identically distributed with unknown population mean µ0 and population variance .
The maximum likelihood estimates for µ0 is given by µˆMLE = n1 Pni=1 Xi and the maximum likelihood estimate for σ02 is given by σˆMLE2 = n1 Pni=1 (Xi − µˆMLE)2.
Apply the maximum likelihood method to compute estimates muˆ MLE for µ0 and σˆMLE2 for σ02.
Next generate a plot which compares the probability density function for your fitted Gaussian model for the tail length of the Red-Tailed hawks with a kernel density plot. Your plot should look as follows:
2 Location estimators with Gaussian data
In this question we compare two estimators for the population mean µ0 in a Gaussian setting in which we have independent and identically distributed data X1,··· ,Xn .
What is the population median of a Gaussian random variable Xi ?
The following code generates a data frame consisting which estimates the mean squared error of the sample median as an estimator of µ0.
set.seed(0) num_trials_per_sample_size<-100 min_sample_size<-5 max_sample_size<-1000 sample_size_inc<-5 mu_0<-1 sigma_0<-3
simulation_df<-crossing(trial=seq(num_trials_per_sample_size),
sample_size=seq(min_sample_size,
max_sample_size,sample_size_inc))%>%
# create data frame of all pairs of sample_size and trial mutate(simulation=pmap(.l=list(trial,sample_size),
.f=~rnorm(.y,mean=mu_0,sd=sigma_0)))%>%
# simulate sequences of Gaussian random variables mutate(sample_md=map_dbl(.x=simulation,.f=median))%>%
# compute the sample medians group_by(sample_size)%>% summarise(msq_error_md=mean((sample_md-mu_0)ˆ2))
Modify the above code to include estimates of the mean square error of the sample mean.
Generate a plot which includes both the mean square error of the sample mean and the sample median as a function of the sample size. Your plot might look something like the following:
3 Unbiased estimation of the population variance
In this question we consider samples X1,··· ,Xn consisting of independent and identically distributed with unknown population mean µ0 and unknown population variance σ02.
Let X be the sample mean, let VˆMLE := n1 Pni=1 Xi − X2 and let VˆU := n−1 1 Pni=1 Xi − X2.
Conduct a simulation study which compares the bias of VˆMLE as an estimator of the population variance σ02 with the bias of VˆU as an estimator for the population variance σ02.
p q
Is VˆU = n−1 1 Pni=1 Xi − X2 an unbiased estimator for σ0?
As an optional extra, give an analytic formula for the bias of VˆMLE and VˆU as estimators of σ02.
4 Maximum likelihood estimators for the Gaussian distribution
Suppose that X1,··· ,Xn are independent and identically distributed with unknown population mean µ0 and unknown population standard deviation σ0.
Given an expression for the likelihood function ℓ(µ,σ2) based upon a sample X1,...,Xn.
Derive a formula for the derivative of the log-likelihood ∂λ∂ logℓ(λ).
Show that X := n1 Pi=1 Xi is the maximum likelihood estimator for µ0 and S2 := n1 Pi=1(Xi − X)2 is the maximum likelihood estimator for σ02.
What is the maximum likelihood estimator for σ0?
5 Maximum likelihood estimation with the Poisson distribution
In this question we shall consider the topic of maximum likelihood estimation for a an independent and identically distributed sample from a Poisson random variable. Recall that Poisson random variables are a family of discrete random variables with distributions supported on N0 := {0,1,2,3,···}. Poisson random variables are frequently used to model the number of events which occur at a constant rate in situations where the occurrence of individual events are independent. For example, we might use the Poisson distribution to model the number of mutations of a given strand of DNA per time unit, or the number of customers who arrive at store over the course of a day. A classic example of statistical modelling based on a Poisson distribution is due to the statistician Ladislaus Josephovich Bortkiewicz. Bortkiewicz used the the Poisson distribution to model the number of fatalities due to horse-kick per year for each group of cavalary. We shall apply maximum likelihood estimation to Bortkiewicz’s data. First let’s explore maximum likelihood estimation for Poisson random variables.
A Poisson random variable has a probability mass function pλ : R → (0,∞) with a single parameter λ > 0. The probability mass function pλ : R → (0,∞) is defined for x ∈ R by
(λxe−λ
pλ(x) = x!
0
for x ∈ N0 for x /∈ N0.
Suppose that you have a sample of independent and identically distributed random variables X1,...,Xn ∼ pλ0 i.e. X1,...,Xn are independent and each has probability mass function pλ0.
Show that for a sample X1,...,Xn, the likelihood function ℓ : (0,∞) → (0,∞) is given by
n !
ℓ(λ) := e−n·λ · λn·X · Y 1 ,
Xi! i=1
where X = n1 Pni=1 Xi is the sample mean.
Derive a formula for the derivative of the log-likelihood ∂λ∂ logℓ(λ).
Show that λ 7→ logℓ(λ) reaches its maximum at the single point at which λ = X. Hence, the maximum likelihood estimate for the true parameter λ0 is λˆMLE.
Now conduct a simulation experiment which explores the behavior of λˆMLE on simulated data. You may wish to consider a setting in which λ0 = 0.5 and generate a plot of the mean squared error as a function of the sample size.
Now that we have explored maximum likelihood estimation with a Poisson distribution for simulated data we shall return to Poission modelling with real data. Let’s take a look at the famous horse-kick fatality data set explored by Ladislaus Josephovich Bortkiewicz. A csv file containing this data is available within Blackboard.
Download the csv file and load the file into an R data frame. You may wish to use the read.csv() function.
The count data for horse fatalities per year, per cavalary corps are given in the “fatalities” column. Model the values in this column as independent random variables X1,...,Xn from a Poisson distribution with parameter λ0 and compute the maximum likelihood estimate λˆMLE for λ0.
Use your fitted Poisson model to give an estimate for the probability that a single cavalry corps has no fatalities due to horse kicks in a single year. You may want to use the dpois function.
As an optional extra give a formula for I. Next generate a simulation involving random samples of size 1000 from a Poisson random variable with parameter λ0 = 0.5. Give a kernel density plot of
pn · I(λ0)λˆMLE − λ0.
6 Maximum likelihood estimation for the exponential distribution
Recall from Assignment 5 that given a positive real number λ > 0, an exponential random variable X with parameter λ is a continuous random variable with density pλ : R → (0,∞) defined by
(
0 if x < 0
pλ(x) := λe−λx if x ≥ 0.
Suppose that X1,··· ,Xn is an i.i.d sample from the exponential distribution with an unknown parameter λ0 > 0. What is the maximum likelihood estimate for λ0?
We shall now use the exponential distribution to model the differences in purchase times between customers at a large supermarket. In Blackboard you will find the “CustomerPurchase” csv file.
Download the “CustomerPurchase” csv file and load the file into an R data frame. You may wish to use the read.csv() function. The first column is the purchase time given in seconds since the store opens.
Add a new column in your data frame called “time_diffs” which gives the time in seconds until the next customer’s purchase. That is, letting Y1,Y2,...,Yn+1 denote the sequence of arrival times in seconds, the time_diffs column contiains X1,...,Xn where Xi = Yi+1 − Yi for each i = 1,...,n. You may want to use the lead() function.
Model the sequence of differences in purchase times X1,...,Xn as independent and identically distributed exponential random variables. Compute the maximum likelihood estimate of the rate parameter λˆMLE.
Use your fitted exponential model to give an estimate of the probability of an arrival time in excess of one minute. You may wish to make use of the pexp() function.
7 (**) MLE for the capture and recapture model
As an optional extra we return to the capture and recapture model discussed in lecture 9.
Recall that in this example there are n0 squirrels living on an island. A conservationist captures t squirrels at random before tagging and releasing them. A week later, the conservationist captures k squirrels at random again, and counts how many have already been tagged. For simplicity we assume that the population of n squirrels is constant over the time period, and on both occasions the squirrels are selected purely at random.
We let Z be the random variable corresponding to the number of recaptured squirrels.
In lecture 9 we showed that for q ≤ min{t,k} we have and P(Z = q) = 0 for q > min{t,k}.
Hence, the likelihood function ℓ : N → [0,1] is given by
t · n−t
Z k−Z
ℓ(n) = n ,
k
for all n ∈ N. Give a formula for the maximum likelihood estimate nˆMLE of n0.