Starting from:

$25

CSE512 -  Machine Learning - Homework 1 - Solved

The last two questions require programming. The maximum number of points is 100 points. For this homework, you should review some material on probability and linear regression.

1          Question 1 – Probability (30 points)
Let X1 and X2 be independent continuous random variables uniformly distributed from 0 to 1. Let X = max(X1,2X2). Compute:

1.    The expectation E(X)

2.    The variance V ar(X)

3.    The covariance: Cov(X,X1).

2          Question 2 – Feature examination (25 points)
Analyzing the range and distribution of feature values are important in machine learning. In this question, you are asked to write a function to approximate the distribution of each individual features by an univariate Gaussian distribution.

2.1     Question 2.1 (10 points)

Write a Python file hw1.py that contains a function with the following signature:

[mu0,var0,mu1,var1] = get mean and variance(X,y)

where

Inputs:

•   X: a two dimensional Numpy array of size n × d, where n is the number of data points, and d the dimension of the feature vectors.

•   y: a Numpy vector of length n. y[i] is a binary label corresponding to the data point X[i,:].

Outputs:

•   mu0: a Numpy vector of length d, mu0[j] is the mean of X[i,j] for all i where y[i] = 0. Basically, mu0[j] is the mean of the jth feature for all the negative data points.

•   var0: a Numpy vector of length d, var0[j] is the variance of X[i,j] for all i where y[i] = 0.

•   mu1: a Numpy vector of length d, mu1[j] is the mean of X[i,j] for all i where y[i] = 1.

•   var1: a Numpy vector of length d, var1[j] is the variance of X[i,j] for all i where y[i] = 1.

2.2     Question 2.2 (15 points)

For this question, you will use the data provided in the file covid19 metadata.csv. This is a subset of the COVID-19 image data collection (github.com/ieee8023/covid-chestxray-dataset).

Inspect your data carefully: Each row corresponds to a patient, which was suspected positive for COVID-19. The first column corresponds to their age (continuous value) and the second one to their gender (F for female or M for male). The last column includes information if they recovered (Y) or not (N).

The first step is to load the data and pre-process them. (Tip: To load the data from the provided csv file, you can use numpy.genfromtxt or csv.reader). In this case, the feature matrix X should include the first two columns, namely the age and gender of the patients. You will need to convert their gender to integer values (e.g. 1 for female and 0 for male). The labels vector y is the last column. You will need to convert it to integer values, namely 1 for Y (survived) and 0 for N (not survived).

Then, run the function of the Question 2.1 on this data.

(a)    (5 points) Report the values of mu0, var0, mu1, var1.

(b)    (5 points) For each feature j, plot the Gaussian distribution with mean mu0[j] and variance var0[j] in black color. On the same graph, plot the Gaussian distribution with mean mu1[j] and variance var1[j] in blue. You can use Python packages matplotlib and scipy.

(c)     (5 points) Is it a good idea to approximate gender by a Gaussian distribution? Why or why not?

3          Question 3 – Linear Regression (45 points)
In this question, you will use Linear Regression to forecast the number of COVID-19 deaths for the next day based on the numbers of the total cases and deaths in the last seven days. Suppose you have collected the data for the past n days (x1,y1),(x2,y2),...,(xn,yn), where xi and yi are the number of the aggregated cases and deaths for day i respectively. You hypothesize that there is a linear relationship between the number of deaths for day t based on the data from the previous days, and you derive a model that predict the number of deaths for day t based on:

                                                         for 8 ≤ t ≤ n.                                                  (1)

3.1     Question 3.1 (25 points)

Write a Python file hw1.py that contains a function with the following signature:

[w,b] = learn reg params(x,y)

where

Inputs:

•   x: a Numpy vector of length n, where n is the number of days

•   y: a Numpy vector of length n

Outputs:

•   w: a Numpy vector of length 14

•   b: a scalar value for the intercept of Linear Regression model.

Tip: you can use sklearn.linear  model.LinearRegression from the scikit-learn library.

3.2     Question 3.2 (20 points)

The file covid19 time series.csv contains the dataset that you will use. The first row is the csv header. The second row contains the aggregated counts of the confirmed positive cases in the US. The third row contains the aggregated counts of deaths in the US. These data are a subset of “JHU CSSE COVID-19 Data”  (https://github.com/CSSEGISandData/COVID-19). The first step is to load the data into x and y. Use the function that you wrote in Question 3.1 to calculate the parameters w and b of Linear Regression model.

(a)    (5 points) Report the learned parameters: the weights w and the intercept term b.

(b)    (5 points) Visualize the actual and predicted death values yt and yˆt (for 8 ≤ t ≤ n). Display yt as a function of t and yˆt as a function of t on the same graph. You can use the library matplotlib.pyplot to plot.

(c)     (5 points) Use a Gaussian to approximate the distribution of the errors yt −yˆt (for 8 ≤ t ≤ n). Report the mean and variance of this Gaussian.

(d)    (5 points) Use matplotlib.pyplot.hist to plot the distribution of yt − yˆt (for 8 ≤ t ≤ n). On the same plot, plot the Gaussian function that approximates this distribution. Is Gaussian a good approximation for the distribution of the errors?

More products