Starting from:

$25

CS156 Homework Assignment 2 -Regression Solved

The objective of this homework assignment is to predict house prices by deploying various predictive models that accept as inputs, variables that significantly influence the price. We will use 4 different models and compare their performance with respect to their predictive accuracy. Here are the models we will use:

1.      Simple Linear Regression

2.      Multiple Linear Regression

3.      Decision Tree Regression

4.      Random Forest Regression

 

The dataset for this project contains house sale prices. There are 16 column headers:

1.      Waterfront      Dummy variable indicating if the house was overlooking a waterfront

2.      Renovated       If the house was renovated

3.      View                An index from 0 to 4 indicating how good the view was. Higher is better

4.      Condition         An index from 1 to 5 on the condition of the apartment. Higher is better

5.      Grade              An index from 1 to 4. Higher the better

6.      Bedrooms        Number of Bedrooms

7.      Bathrooms       Number of Bathrooms (can have 0.5 to indicate half bathroom)

8.      Sqft_living       Square footage of Interior living space

9.      Sqft_lot                        Square footage of Interior land space

10.  Floors              Number of floors

11.  Sqft_above      Square footage of the interior living space that is above ground level

12.  Sqft_basement            Square footage of the interior living space that is below ground level

13.  Yr_built                       The year the house was initially built

14.  Sqft_living15   Square footage of the living area of the nearest 15 neighbors

15.  Sqft_lot15        Square footage of the land lots of the nearest 15 neighbors

16.  Price                Price of sale

            

Part (A): Data Import, Data Pre-processing

a.       Read the file Housing-Data-one-zip-3.csv

b.       Convert categorical data: Waterfront, Renovated, View, Condition, Grade

c.       Transform some data. For example, you may transform the column Yr_built to reflect the age of the building by subtracting Yr_vuilt from 2020.

d.      Divide the data set into Training set and Test set be

 

We will use the same data set for all 4 prediction algorithms in this assignment. Here are the assumptions for the first 5 fields of the data set and the inputs for your program to do the prediction of house prices. Predict the house price for the following cases (Note: Age = 2020-Yr_built)

 

Assume

waterfront
renovated
view
condition
 

grade
 
 
 
 
 
0
0
0
3
3
 

[Bedroom, Bathhrooms, Sqft_living, Sqft_lot, Floors, Sqft_above, Sqft_basement, Age, Sqft_living15, Sqft_lot15]      

                                                               i.      [3, 0.75, 2510, 20000, 2.0, 2510, 0, 59, 2130, 20000]

                                                             ii.      [4, 2.25, 1500,   5393, 2.0, 1500, 0, 21, 1500,    5952]

                                                           iii.      [4, 2.25, 2870,   5393, 2.0, 2870, 0, 21, 1500, 5952]

                                                           iv.      [4, 3.50, 4083, 68377, 2.0, 4083, 0, 15, 2430, 41382]

                                                             v.      [4, 3.50, 4500, 68377, 2.0, 4500, 0, 15, 2430, 41382]

                                                           vi.      [4, 3.50, 2870, 68377, 2.0, 2870, 0, 15, 2430, 41382]

                                                         vii.      [4, 3.50,   750, 68377, 2.0,   750, 0, 15, 2430, 41382

 

 

Part (B):  Use Simple Linear Regression to predict the house price using Sqft_living as the independent variable

a.       Print Rsquare

b.      Plot the linear regression line for the Training Data Set

c.       Plot the linear regression line for the Test Data Set

d.      Predict the house prices for the test data set given above.

Part (C):  Use Multiple Linear Regression using all variables to predict the house price

 

a.       Print Rsquare

b.      Predict the house prices for the test data set given above.

Part (D):  Use Decision Tree Regression model to predict the house price

a.       Print Rsquare

b.      Predict the house prices for the test data set given above.

Part (E):  Use Random Forest Regression model (use 10 Random Trees) to predict house price

a.       Print Rsquare

b.      Predict the house prices for the test data set given above.

 

 

Summarize your observations:

1.      Tabulate the result as follows:

 

 

Test Data Point 
Simple Linear Regression
Multiple Linear Regression
Decision Tree Regression
Random Forest Regression
(i)
356363.12752274
 
322853.75707537
 
363000
 
405900
 
(ii)
232887.30257624
 
223043.61780165
 
215000
 
218050
 
(iii)
400374.31265219
 
402892.93768066
 
299000
 
317590
 
(iv)
548667.55588001
 
557862.19128196
 
359000
 
474178.8
 
(v)
599647.17865495
 
588308.23996124
 
359000
 
474178.8
 
(vi)
400374.31265219
 
469298.50531561
 
359000
 
422128.8
 
(vii)
141197.33355656
 
314512.83816915
 
194820
 
294180
 
R-Square
0.6682006794899293
 
0.8072554741507528
 
0.9952504116289396
 
0.9503025303839485
 
 

2.      Which predictive model performed the best and why do you think so?

a.       although the r-squared value for decision tree method is the highest, there are repeated values in the table when the 7th parameter is the same, which shows that it isn’t the best estimator. random forest performs somewhat worse than the decision tree in terms of r-squared, but the predictions seem to line up better with the actual data. 

 

3.      Which variables are most important for prediction? Use Multiple Linear Regression Model to justify your answer. Hint: use print(regressor.coef_) to print out the coefficients for the independent variables and focus on the last 10 coefficients.

a.       we get the following coefficients, where those in boldface and purple are the coefficients of focus:

8.90689220e+04

4.81272955e+04

1.59078621e+04

7.76704263e+03

3.35765190e+04

-6.92876340e+03

4.32242173e+03

4.66709071e+01

             6.49099103e-01

            -4.18199657e+03

             2.63412000e+01

             2.03297072e+01

            -6.92693042e+02

1.71650799e+01

2.25297017e+00

 

b.      now, we can see the scores of each of the coefficients: we can conclude the most important feature is number of bedrooms, floors, and bathrooms. the others are in boldface below:

4.  Feature: 5, Score: -6928.76340

5.  Feature: 6, Score: 4322.42173

6.  Feature: 7, Score: 46.67091

7.  Feature: 8, Score: 0.64910

8.  Feature: 9, Score: -4181.99657

9.  Feature: 10, Score: 26.34120

10.   Feature: 11, Score: 20.32971

11.   Feature: 12, Score: -692.69304

12.   Feature: 13, Score: 17.16508

13.   Feature: 14, Score: 2.25297

 

 

 

More products