Starting from:

$30

CSDS313- Assignment 2: Data and Distributions Solved

The purpose of this exercise is to investigate how different distributions can have similar statistics and/or visualizations. Suppose you are given a normal distribution N(µ,σ). We would like to estimate a uniform distribution U(a,b) (i.e., the range of the distribution is [a,b]) with identical statistics to the given normal distribution. These statistics are specified as follows:

(i)     Find the parameters (a and b) of a uniform distribution in terms of µ and σ such that the mean and standard deviation of uniform distribution is the same as the given normal distribution.

(ii)   Find the parameters (a and b) of a uniform distribution in terms of µ and σ such that the 25th and 75th percentile points of the uniform distribution and the given normal distribution are the same. Assume you can compute inverse cumulative distribution function Φ−1(p,µ,σ) of a normal distribution N(µ,σ) for any 0 ≤ p ≤ 1. See probit function for more information. Hint: You should estimate the parameters of uniform distribution a and b by simply using Φ−1(p,µ,σ).

For parts (i) and (ii) separately, obtain a uniform distribution U(a,b) as a function of µ and σ i.e., find a = fa(µ,σ) and b = fb(µ,σ). Then, estimate the parameters of uniform distributions U1(a1,b1) and U2(a2,b2) corresponding to parts (i) and (ii) for the normal distribution N(µ = 2,σ = 5). Simulate 10000 data points from each of the U1(a1,b1), U2(a2,b2) and N(2,5) distributions separately. Visualize the 3 simulated distributions using histograms, error bars, and boxplots. Compare and comment on how the obtained uniform distributions are similar or unsimilar to the given normal distribution. Also, compare and comment on how they are similar or unsimilar to each other.

Note that, you can compute the probit function Φ−1(p,µ,σ) as follows:

MATLAB: norminv function.

Python: norm.ppf function in scipy.stats package.

R: qnorm function.

Problem 2
For this exercise, we will use two datasets that are provided with the assignment:

The file “airport routes.csv” contains the number of available routes of 3409 airports all around the world (as of February 2017). Each row indicates an airport (identified with a 3-letter code) and the number of routes. For example, ”CLE, 81” indicates that Cleveland Hopkins International Airport has outgoing flights to 81 different airports. See data source for more information.

The file “movie votes.csv” contains the average rating (between 1 and 10) of 4392 movies in TMDb database sorted in descending order. Each row contains a movie name and the average

TMDb vote of that movie. For example, "The Godfather", 8.4, "Interstellar",8.1 etc. See data source for more information.

For each of these datasets, consider the following models:

(a)    Suppose the given data points follow a power law distribution. Estimate the corresponding α parameter. You can use the maximum likelihood estimation in Newman’s notes on power-law.

(b)   Suppose the given data points follow an exponential distribution. Estimate the corresponding λ parameter.

(c)    Suppose the given data points follow a uniform distribution.

Estimate the corresponding range parameters [a, b] of the uniform distribution.

(d)   Suppose the given data points follow a normal distribution. Estimate the corresponding µ and σ parameters.

For each these dataset separately, compare the models you estimated in parts (a) to (d). Which distribution do you think the data follows and why? Explain. For each model, generate random data samples drawn from the respective distribution. Use visualizations of the empirical data and the data you generate to support your conclusions.

Problem 3
Recall the rocket problem from exercise 3: You are working as chief data scientist at a rocket production company. You know that your company’s competitor is assigning integer IDs to their rockets. In other words, if the competitor produced M rockets, there is a rocket with ID i for all 1 ≤ i ≤ M. Your company’s intelligence wasable to collect the IDs of n rockets produced by the competitor and these IDs are 1 ≤ x1 ≤ x2 ≤...≤ xn. You can assume that the IDs collected by the intelligence represent a uniform sampling of the M IDs.

(i)     What is the maximum liklihood estimator for M. Simulate the rockets and intelligence reports to show if the maximum liklihood estimator is an unbiased estimator. (hint: make sure to choose a large M and and large number of trials for your simulation)

(ii)   Let  1. Let  1 Simulate the rockets and intelligence reports to show which of the above unbiased estimators (MˆMVU or MˆMEAN) has the lower variance.

More products