$30
1. MSE in terms of bias
For some estimator 𝜃, show that MSE = bias2(𝜃) + Var(𝜃). Show your steps clearly.
Practice with empirical CDF (eCDF) (Total 5 points)
Using the first 10 samples from the collisions.csv file on the class website, carefully draw the eCDF by hand. Make sure the x‐ and y‐axis clearly indicate the sample points and their corresponding eCDF. Your plot must have y‐limits from 0 to 1, and x‐limits from smallest sample to the largest sample.
Programming fun with 𝑭
For this question, we require some programming; you should only use Python. You may use the scripts provided on the class website as templates. Do not use any libraries or functions to bypass the programming effort. Please submit your code as usual in your zip/tar file repo on BB. Provide sufficient documentation so the code can be evaluated. Also attach each plot as a separate sheet (or image) to your submission upload. All plots must be neat, legible (large fonts), with appropriate legends, axis labels, titles, etc.
(a) Write a program to plot 𝐹 (empirical CDF or eCDF) given a list of samples as input. Your plot must have y‐limits from 0 to 1, and x‐limits from 0 to the largest sample. Show the input points as crosses on the x‐axis. (2 points)
(b) Use an integer random number generator with range [1, 99] to draw n=10, 100, and 1000 samples.
Feed these as input to (a) to draw three plots. What do you observe?
(c) Modify (a) above so that it takes as input a collection of list of samples; that is, a 2‐D array of sorts where each row is a list of samples (as in (a)). The program should now compute the average 𝐹 across the rows and plot it. That is, for a given x point, first compute the 𝐹 for each row (student), then average them all out across rows, and plot the average 𝐹 for x. Repeat for all input points, x.
Show all input points as crosses on the x‐axis.
(d) Use the same integer random number generator from (b) to draw n=10 samples for m=10, 100,
1000 rows. Feed these as input to (d) to draw three plots. What do you observe?
(e) Modify the program from (a) to now also add 95% Normal‐based CI lines for 𝐹, given a list of samples as input. Draw a plot showing 𝐹 and the CI lines for the a3_q3.dat data file (799 samples) on the class website. Use x‐limits of 0 to 2, and y‐limits of 0 to 1. (2 points) (f) Modify the program from (e) to also add 95% DKW‐based CI lines for 𝐹. Draw a single plot showing
𝐹 and both sets of CI lines (Normal and DKW) for the a3_q3.dat data. Which CI is tighter?
Plug‐in estimates
(a) Show that the plug‐in estimator of the variance of X is 𝜎 ∑ 𝑋 𝑋 , where 𝑋 is the
sample mean, 𝑋 ∑ 𝑋 .
(b) Show that the bias of 𝜎 is 𝜎 /𝑛, where 𝜎 is the true variance. (3 points) (c) The kurtosis for a RV X with mean 𝜇 and variance 𝜎 is defined a𝑠 𝐾𝑢𝑟𝑡 𝑋 𝐸 𝑋 𝜇 ⁄ 𝜎 .
Derive the plug‐in estimate of the kurtosis in terms of the sample data.
(d) The plug‐in estimator idea also extends to two RVs. Consider 𝜌 𝐸 𝑋𝑌 𝐸 𝑋 𝐸 𝑌 /𝜎 𝜎 , where σX is the standard deviation for RV X. Assuming n i.i.d. observations for X and Y that appear in pairs as {(X1, Y1), (X2, Y2), …, (Xn, Yn)}, derive the plug‐in estimator for ρ. (Hint: What is the ePMF for the
event X=X1 AND Y=Y1? What about for the event X=X1 AND Y=Y2?)
5. Consistency of eCDF
Let D={X1, X2, …, Xn} be a set of i.i.d. samples with true CDF F. Let 𝐹 be the eCDF for D, as defined in class.
(a) Derive E(𝐹) in terms of F. Start by writing the expression for 𝐹 at some α.
(b) Show that bias(𝐹) = 0.
(2 points)
(c) Derive se(𝐹) in terms of F and n.
(3 points)
(d) Show that 𝐹 is a consistent estimator.
6. Properties of estimators (Total 10 points)
(a) Find the bias, se, and MSE in terms of 𝜃 for 𝜃 ∑ 𝑋 , where Xi are i.i.d. ~ Bernoulli(θ). Hint:
Follow the same steps as in class, assuming the true distribution is unknown. Only at the end use the fact that the unknown distribution is Bernoulli(θ) to get the final answers in terms of 𝜃. (5 points) (b) Derive the Normal‐based (1‐α) CI for 𝜃. Explain why Normal‐based CIs are applicable here.
7. Kernel density estimation
This question asks you to implement Kernel density estimator (KDE) from scratch and evaluate it for a sample dataset, a3_q7.csv. Do not use inbuilt KDE functions. But, you can use inbuilt pdf functions to estimate pdf at a point. The formal definition of KDE, which estimates pdf, is:
𝑓 𝑥 ∑ 𝐾 (1)
where K(.) is called the kernel function which should be a smooth, symmetric and a valid density function. Parameter h 0 is called the smoothing bandwidth that controls the amount of smoothing.
(a) For the a3_q7.csv dataset, the true distribution is Normal(0.5, 0.01) (the mean value μ is 0.5 , the variance 𝜎 is 0.01). The task here is to implement a KDE function using the Normal distribution as the kernel, normal_kde(x,h,D) in python, where x is the point at which the pdf is to be estimated, h is the bandwidth and D is the list of data points. Implement the function as normal_kde.py by first computing 𝐾 for all data points xi in given dataset, where K(u) is the pdf of the standard
Normal at point u = , and then summing up all K() values and dividing by nh, where n is number of data points, as in Equation (1) above. Submit your code. (3 points)
(b) Obtain the p.d.f. for x = {0, 0.01, 0.02, ...,1} and compute the sample mean and sample variance (use result of Q4(a) as needed). Report the deviation (as a percentage difference with respect to true mean or variance) of the estimates from the original distribution (Normal(0.5, 0.01)) in each of the 5 cases. Show on a single plot the pdf of the original Normal and the KDE estimates of the pdf for all 5 bandwidths. Include this plot in your submission. Which of the h values performs best?
(c) Repeat (a) and (b) above when using the uniform kernel (implement as uniform_kde(x,h,D) as ½ 𝑓𝑜𝑟 1 𝑢 1 uniform_kde.py) with the function K(u) described as K u , where u = ,
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 and Triangular distribution, triangular_kde(x,h,D) (implement as triangular_kde.py), using triangle kernel described as K(u) = 1‐|u| for |u| ≤ 1 (and K(u) = 0 otherwise), where u = . Repeat all parts of (b) for these two kernels for all 5 bandwidth values and report the percentage deviation from original mean and variance, plot the KDE estimates, and report the best bandwidth for each kernel
choice.