$25
You will need to work with the three datasets attached to this assignment:
• poverty.csv
• poverty_2.csv
• real_estate.csv
1 Problem 1: Univariate Linear Regression
1.1 1) import the libraries you will need:
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
1.2 2) Import the date poverty.csv dataset
<IPython.core.display.HTML object>
Saving real_estate.csv to real_estate (4).csv
1.3 3) Print the dataset indexed upon the location column.
[ ]: PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth
Location
Alabama 20.1 31.5 88.7 11.2 54.5
Alaska 7.1 18.9 73.7 9.1 39.5
Arizona 16.1 35.0 102.5 10.4 61.2
Arkansas 14.9 31.6 101.7 10.4 59.9
California 16.7 22.6 69.1 11.2 41.1
Colorado 8.8 26.2 79.1 5.8 47.0
Connecticut 9.7 14.1 45.1 4.6 25.8
Delaware 10.3 24.7 77.8 3.5 46.3
District_of_Columbia 22.0 44.8 101.5 65.0 69.1
Florida 16.2 23.2 78.4 7.3 44.5
Georgia 12.1 31.4 92.8 9.5 55.7
Hawaii 10.3 17.7 66.4 4.7 38.2
Idaho 14.5 18.4 69.1 4.1 39.1
Illinois 12.4 23.4 70.5 10.3 42.2
Indiana 9.6 22.6 78.5 8.0 44.6
Iowa 12.2 16.4 55.4 1.8 32.5
Kansas 10.8 21.4 74.2 6.2 43.0
Kentucky 14.7 26.5 84.8 7.2 51.0
Louisiana 19.7 31.7 96.1 17.0 58.1
Maine 11.2 11.9 45.2 2.0 25.4
Maryland 10.1 20.0 59.6 11.8 35.4
Massachusetts 11.0 12.5 39.6 3.6 23.3
Michigan 12.2 18.0 60.8 8.5 34.8
Minnesota 9.2 14.2 47.3 3.9 27.5
Mississippi 23.5 37.6 103.3 12.9 64.7
Missouri 9.4 22.2 76.6 8.8 44.1
Montana 15.3 17.8 63.3 3.0 36.4
Nebraska 9.6 18.3 64.2 2.9 37.0
Nevada 11.1 28.0 96.7 10.7 53.9
New_Hampshire 5.3 8.1 39.0 1.8 20.0
New_Jersey 7.8 14.7 46.1 5.1 26.8
New_Mexico 25.3 37.8 99.5 8.8 62.4
New_York 16.5 15.7 50.1 8.5 29.5
North_Carolina 12.6 28.6 89.3 9.4 52.2
North_Dakota 12.0 11.7 48.7 0.9 27.2
Ohio 11.5 20.1 69.4 5.4 39.5
Oklahoma 17.1 30.1 97.6 12.2 58.0
Oregon 11.2 18.2 64.8 4.1 36.8
Pennsylvania 12.2 17.2 53.7 6.3 31.6
Rhode_Island 10.6 19.6 59.0 3.3 35.6
South_Carolina 19.9 29.2 87.2 7.9 53.0
South_Dakota 14.5 17.3 67.8 1.8 38.0
Tennessee 15.5 28.2 94.2 10.6 54.3
Texas 17.4 38.2 104.3 9.0 64.4
Utah 8.4 17.8 62.4 3.9 36.8
Vermont 10.3 10.4 44.4 2.2 24.2
Virginia 10.2 19.0 66.0 7.6 37.6
Washington 12.5 16.8 57.6 5.1 33.0
West_Virginia 16.7 21.5 80.7 4.9 45.5
Wisconsin 8.5 15.9 57.1 4.3 32.3
Wyoming 12.2 17.7 72.1 2.1 39.9
1.4 4) Get useful descriptive statistial data on the dataset.
Hint: this is a single line, data._____
[ ]: poverty.describe()
[ ]: PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth
count 51.000000 51.000000 51.000000 51.000000 51.000000
mean 13.117647 22.282353 72.019608 7.854902 42.243137
std 4.277228 8.043499 18.975563 8.914131 12.318511
min 5.300000 8.100000 39.000000 0.900000 20.000000
25% 10.250000 17.250000 58.300000 3.900000 33.900000
50% 12.200000 20.000000 69.400000 6.300000 39.500000
75% 15.800000 28.100000 87.950000 9.450000 52.600000
max 25.300000 44.800000 104.300000 65.000000 69.100000
1.5 5) Print the columns
Index(['PovPct', 'Brth15to17', 'Brth18to19', 'ViolCrime', 'TeenBrth'], dtype='object')
1.6 6) Create a regression line based upon the dependent and independent variables:
PovPct Brth18to19
In this step only create a scatterplot of the two variables, simply plotting the data.
Note: The variable PovPct is the percent of a state’s population in 2000 living in households with incomes below the federally defined poverty level.
[ ]: plt.scatter(poverty.PovPct, poverty.Brth18to19)
[ ]: <matplotlib.collections.PathCollection at 0x7efc09f48d50>
1.7 7) Lets create a new variable, x1, as well as the results variable:
Example would be 1. x1 = sm.add_constant(x) 2. results = sm.OLS(y, x1).fit() 3. results.summary() This gives you the OLS Regression results, the coefficients table, and some additional tests. The data that you are interested in is the coefficient values. This is the value for the constant you created is b0, and birth19to19 is b1 in the regression equation.
[ ]: x1 = sm.add_constant(poverty.PovPct)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117:
FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
x = pd.concat(x[::order], 1)
[ ]: <class 'statsmodels.iolib.summary.Summary'> """
OLS Regression Results
==============================================================================
Dep. Variable: Brth18to19 R-squared: 0.422
Model: OLS Adj. R-squared: 0.410
Method: Least Squares F-statistic: 35.78
Date: Thu, 17 Feb 2022 Prob (F-statistic): 2.50e-07
Time: 02:35:21 Log-Likelihood: -207.98
No. Observations: 51 AIC: 420.0
Df Residuals: 49 BIC: 423.8
Df Model: 1
Covariance Type: nonrobust
============================================================================== coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------const 34.2124 6.641 5.151 0.000 20.866 47.559 PovPct 2.8822 0.482 5.982 0.000 1.914 3.850
==============================================================================
Omnibus: 1.175 Durbin-Watson: 2.161
Prob(Omnibus): 0.556 Jarque-Bera (JB): 0.988
Skew: 0.088 Prob(JB): 0.610
Kurtosis: 2.341 Cond. No. 45.1
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """
1.8 8) Taking the coeffient values for the new constant and the Y variable, create a scatterplot:
e.g. yhat = 0.1464*x + 0.25712 fig = plt.plot(x, yhat, lw=4, c=’red’, label = ’regression line’)
[ ]: plt.scatter(poverty.PovPct, poverty.Brth18to19) plt.plot([0,25],[34.2124,2.8822*25+34.2124], color="magenta")
[ ]: [<matplotlib.lines.Line2D at 0x7fb63c15ab10>]
2 Problem 2: Implement code from lecture
2.1 1) Perform linear regression using the normal equation, as done in slides.
[ ]: [<matplotlib.lines.Line2D at 0x7fb641086710>]
[2. 93813808]])
[ ]: plt.plot(X,y,"b.") plt.plot([0,2],[theta_best[0],theta_best[1]*2+theta_best[0]], color="magenta")
[ ]: [<matplotlib.lines.Line2D at 0x7fb63be61190>]
2.2 2) Perform linear regression using Scikit-Learn, as done in the slides.
[ ]: (array([4.52769162]), array([[2.93813808]]))
/usr/local/lib/python3.7/dist-packages/numpy/core/shape_base.py:65:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. ary = asanyarray(ary)
[ ]: [<matplotlib.lines.Line2D at 0x7fb63bf92d50>]
3 Problem 3: Multivariate Linear Regression
In this problem we will continue using the poverty dataset. Do poverty and violent crimes affect teen pregnancy?
3.1 1) import the libraries you will need: numpy pandas matplotlab.pyplot statsmodels.api
3.2 2) Import the dataset, poverty_2.csv, and print it.
[ ]: PovPct ViolCrime TeenBrth
0 20.1 11.2 54.5
1 7.1 9.1 39.5
2 16.1 10.4 61.2
3 14.9 10.4 59.9
4 16.7 11.2 41.1
5 8.8 5.8 47.0
6 9.7 4.6 25.8
7 10.3 3.5 46.3
8 22.0 65.0 69.1
9 16.2 7.3 44.5
10 12.1 9.5 55.7
11 10.3 4.7 38.2
12 14.5 4.1 39.1
13 12.4 10.3 42.2
14 9.6 8.0 44.6
15 12.2 1.8 32.5
16 10.8 6.2 43.0
17 14.7 7.2 51.0
18 19.7 17.0 58.1
19 11.2 2.0 25.4
20 10.1 11.8 35.4
21 11.0 3.6 23.3
22 12.2 8.5 34.8
23 9.2 3.9 27.5
24 23.5 12.9 64.7
25 9.4 8.8 44.1
26 15.3 3.0 36.4
27 9.6 2.9 37.0
28 11.1 10.7 53.9
29 5.3 1.8 20.0
30 7.8 5.1 26.8
31 25.3 8.8 62.4
32 16.5 8.5 29.5
33 12.6 9.4 52.2
34 12.0 0.9 27.2
35 11.5 5.4 39.5
36 17.1 12.2 58.0
37 11.2 4.1 36.8
38 12.2 6.3 31.6
39 10.6 3.3 35.6
40 19.9 7.9 53.0
41 14.5 1.8 38.0
42 15.5 10.6 54.3
43 17.4 9.0 64.4
44 8.4 3.9 36.8
45 10.3 2.2 24.2
46 10.2 7.6 37.6
47 12.5 5.1 33.0
48 16.7 4.9 45.5
49 8.5 4.3 32.3
50 12.2 2.1 39.9
3.3 3) We need to normalize the input variables.
3.4 4) Split the data into input variables, X, and the output variable, Y.
3.5 5) Graph the dataset with a seed of 42.
3.6 6) Implement Gradient Descent.
This section has be provided. Please run and understand the code.
iteration : 0 loss : 0.14248615838353396 iteration : 100 loss : 0.005841062708110685 iteration : 200 loss : 0.005374637890296291 iteration : 300 loss : 0.005059239296919674 iteration : 400 loss : 0.0048406904218634165
[ ]: array([[0.41375839],
[0.26569717],
3.7 7) Implement Stochastic Gradient Descent. Please run.
iteration : 0 loss : 0.0066577245043739405 iteration : 100 loss : 0.003102327706993443 iteration : 200 loss : 0.002532377208293092 iteration : 300 loss : 0.0023333911770596814 iteration : 400 loss : 0.0022626837845736957
4 Problem 4, predict house price.
• import real_estate.csv
• Are there any null values in the dataset? Drop any missing data if exist.
• Create X as a 1-D array of the distance to the nearest MRT station, and y as the housing price
• What is the number of samples in the data set? To do this, you can look at the "shape" of X and y
• Split the data into train and test sets using sklearn’s train_test_split, with test_size = 1/3
• Find the line of best fit using a Linear Regression and show the result of coefficients and intercept (you can use sklearn’s linear regression)
• Using the predict method, make predictions for the test set and evaluate the performance (e.g., MSE or other metrics).
[4]: X1 transaction date ... Y house price of unit area
No ...
1 2012.917 ... 37.9
2 2012.917 ... 42.2
3 2013.583 ... 47.3
4 2013.500 ... 54.8
5 2012.833 ... 43.1
.. ... ... ...
410 2013.000 ... 15.4
411 2012.667 ... 50.0
412 2013.250 ... 40.6
413 2013.000 ... 52.5
414 2013.500 ... 63.9
[414 rows x 7 columns]
[5]: X1 transaction date ... Y house price of unit area
No ...
1 2012.917 ... 37.9
2 2012.917 ... 42.2
3 2013.583 ... 47.3
4 2013.500 ... 54.8
5 2012.833 ... 43.1
.. ... ... ...
410 2013.000 ... 15.4
411 2012.667 ... 50.0
412 2013.250 ... 40.6
413 2013.000 ... 52.5
414 2013.500 ... 63.9
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names
"X does not have valid feature names, but"
[109]: <matplotlib.collections.PathCollection at 0x7f578280b050>
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names
"X does not have valid feature names, but"
[112]: 73.4172288261123
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names
"X does not have valid feature names, but"
[113]: 72.16393207172295
Quartic has the least squared error, got larger for 5 or 6 powered, MSE is 72.1639
Mounted at /content/drive/
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
WARNING: apt does not have a stable CLI interface. Use with caution in scripts. Extracting templates from packages: 100%
[ ]: