$25
The objective of this homework assignment is to predict house prices by deploying various predictive models that accept as inputs, variables that significantly influence the price. We will use 4 different models and compare their performance with respect to their predictive accuracy. Here are the models we will use:
1. Simple Linear Regression
2. Multiple Linear Regression
3. Decision Tree Regression
4. Random Forest Regression
The dataset for this project contains house sale prices. There are 16 column headers:
1. Waterfront Dummy variable indicating if the house was overlooking a waterfront
2. Renovated If the house was renovated
3. View An index from 0 to 4 indicating how good the view was. Higher is better
4. Condition An index from 1 to 5 on the condition of the apartment. Higher is better
5. Grade An index from 1 to 4. Higher the better
6. Bedrooms Number of Bedrooms
7. Bathrooms Number of Bathrooms (can have 0.5 to indicate half bathroom)
8. Sqft_living Square footage of Interior living space
9. Sqft_lot Square footage of Interior land space
10. Floors Number of floors
11. Sqft_above Square footage of the interior living space that is above ground level
12. Sqft_basement Square footage of the interior living space that is below ground level
13. Yr_built The year the house was initially built
14. Sqft_living15 Square footage of the living area of the nearest 15 neighbors
15. Sqft_lot15 Square footage of the land lots of the nearest 15 neighbors
16. Price Price of sale
Part (A): Data Import, Data Pre-processing
a. Read the file Housing-Data-one-zip-3.csv
b. Convert categorical data: Waterfront, Renovated, View, Condition, Grade
c. Transform some data. For example, you may transform the column Yr_built to reflect the age of the building by subtracting Yr_vuilt from 2020.
d. Divide the data set into Training set and Test set be
We will use the same data set for all 4 prediction algorithms in this assignment. Here are the assumptions for the first 5 fields of the data set and the inputs for your program to do the prediction of house prices. Predict the house price for the following cases (Note: Age = 2020-Yr_built)
Assume
waterfront
renovated
view
condition
grade
0
0
0
3
3
[Bedroom, Bathhrooms, Sqft_living, Sqft_lot, Floors, Sqft_above, Sqft_basement, Age, Sqft_living15, Sqft_lot15]
i. [3, 0.75, 2510, 20000, 2.0, 2510, 0, 59, 2130, 20000]
ii. [4, 2.25, 1500, 5393, 2.0, 1500, 0, 21, 1500, 5952]
iii. [4, 2.25, 2870, 5393, 2.0, 2870, 0, 21, 1500, 5952]
iv. [4, 3.50, 4083, 68377, 2.0, 4083, 0, 15, 2430, 41382]
v. [4, 3.50, 4500, 68377, 2.0, 4500, 0, 15, 2430, 41382]
vi. [4, 3.50, 2870, 68377, 2.0, 2870, 0, 15, 2430, 41382]
vii. [4, 3.50, 750, 68377, 2.0, 750, 0, 15, 2430, 41382
Part (B): Use Simple Linear Regression to predict the house price using Sqft_living as the independent variable
a. Print Rsquare
b. Plot the linear regression line for the Training Data Set
c. Plot the linear regression line for the Test Data Set
d. Predict the house prices for the test data set given above.
Part (C): Use Multiple Linear Regression using all variables to predict the house price
a. Print Rsquare
b. Predict the house prices for the test data set given above.
Part (D): Use Decision Tree Regression model to predict the house price
a. Print Rsquare
b. Predict the house prices for the test data set given above.
Part (E): Use Random Forest Regression model (use 10 Random Trees) to predict house price
a. Print Rsquare
b. Predict the house prices for the test data set given above.
Summarize your observations:
1. Tabulate the result as follows:
Test Data Point
Simple Linear Regression
Multiple Linear Regression
Decision Tree Regression
Random Forest Regression
(i)
356363.12752274
322853.75707537
363000
405900
(ii)
232887.30257624
223043.61780165
215000
218050
(iii)
400374.31265219
402892.93768066
299000
317590
(iv)
548667.55588001
557862.19128196
359000
474178.8
(v)
599647.17865495
588308.23996124
359000
474178.8
(vi)
400374.31265219
469298.50531561
359000
422128.8
(vii)
141197.33355656
314512.83816915
194820
294180
R-Square
0.6682006794899293
0.8072554741507528
0.9952504116289396
0.9503025303839485
2. Which predictive model performed the best and why do you think so?
a. although the r-squared value for decision tree method is the highest, there are repeated values in the table when the 7th parameter is the same, which shows that it isn’t the best estimator. random forest performs somewhat worse than the decision tree in terms of r-squared, but the predictions seem to line up better with the actual data.
3. Which variables are most important for prediction? Use Multiple Linear Regression Model to justify your answer. Hint: use print(regressor.coef_) to print out the coefficients for the independent variables and focus on the last 10 coefficients.
a. we get the following coefficients, where those in boldface and purple are the coefficients of focus:
8.90689220e+04
4.81272955e+04
1.59078621e+04
7.76704263e+03
3.35765190e+04
-6.92876340e+03
4.32242173e+03
4.66709071e+01
6.49099103e-01
-4.18199657e+03
2.63412000e+01
2.03297072e+01
-6.92693042e+02
1.71650799e+01
2.25297017e+00
b. now, we can see the scores of each of the coefficients: we can conclude the most important feature is number of bedrooms, floors, and bathrooms. the others are in boldface below:
4. Feature: 5, Score: -6928.76340
5. Feature: 6, Score: 4322.42173
6. Feature: 7, Score: 46.67091
7. Feature: 8, Score: 0.64910
8. Feature: 9, Score: -4181.99657
9. Feature: 10, Score: 26.34120
10. Feature: 11, Score: 20.32971
11. Feature: 12, Score: -692.69304
12. Feature: 13, Score: 17.16508
13. Feature: 14, Score: 2.25297