Starting from:

$25

DATA201 - Assignment 2 Solved

The dataset for this assignment is file sales_data.csv which is provided with this notebook.

Please choose menu items Kernel = Restart & Run All then File = Save and Checkpoint in Jupyter before submission.

Problem Statement
A retail company wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and total purchase_amount from last month.

You need to build a model to predict the purchase amount of customer against various products which will help the company to create personalized offer for customers against different products.

Data
                                                          Variable                                                          Description



User_ID
User ID
Product_ID
Product ID
Gender
Sex of User
Age
Age in bins
Occupation
Occupation (Masked)
City_Category
Category of the City (A, B, C)
Stay_In_Current_City_Years
Number of years stay in current city
Marital_Status
Marital Status
Product_Category_1
Product Category (Masked)
Product_Category_2
Product may belongs to other category also (Masked)
Product_Category_3
Product may belongs to other category also (Masked)
Purchase
Purchase Amount (Target Variable)
Evaluation

The root mean squared error (RMSE) will be used for model evaluation.

Questions and Code
In [1]:

import numpy as np import pandas as pd from sklearn import metrics

from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsRegressor from sklearn.pipeline import Pipeline

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler np.random.seed = 42

Load the given dataset.

In [2]:



Out[2]:

Age                            object

City_Category                  object

Gender                         object

Marital_Status                  int64

Occupation                      int64

Product_Category_1              int64

Product_Category_2              int64

Product_Category_3              int64

Product_ID                      int64

Purchase                      float64

Stay_In_Current_City_Years     object User_ID                         int64 dtype: object

1. Is there any missing value? [1 point]

In [3]:



Out[3]:

Age                           0

City_Category                 0

Gender                        0

Marital_Status                0

Occupation                    0

Product_Category_1            0

Product_Category_2            0

Product_Category_3            0

Product_ID                    0

Purchase                      0

Stay_In_Current_City_Years    0 User_ID                       0 dtype: int64

2.  Drop attribute User_ID . [1 point] In [4]:



3.  Then convert the following categorical attributes below to numerical values with the rule as below.

[4 points]

Gender : F :0, M :1

Age : 0-17 :0, 18-25 :1, 26-35 :2, 36-45 :3, 46-50 :4, 51-55 :5, 55+ :6

Stay_In_Current_City_Years : 0 :0, 1 :1, 2 :2, 3 :3, 4+ :4

You may want to apply a lambda function to each row of a column in the dataframe. Some examples here may be helpful: https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/

(https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/)

In [5]:

data['Gender'] = data['Gender'].map({'F':0,'M':1})

data['Age'] = data['Age'].map({'0-17':0,'18-25':1, '26-35':2, '36-45':3, '46-50':4, '51

-55':5, '55+':6}) data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].map({'0':0,'1':

1, '2':2, '3':3, '4+':4}) data.head() Out[5]:

      Age   City_Category   Gender   Marital_Status   Occupation   Product_Category_1   Product_Cat



0           0 A          0          0          10         1

1           4 B          1          1          7          1

2           2 A          1          1          20         1

3           5 A          0          0          9          5

4         5                      A            0                      0                  9                               2


4.  Randomly split the current data frame into 2 subsets for training (80%) and test (20%). Userandom_state = 42. [2 points]

In [6]:

data_train, data_test = train_test_split(data, test_size=0.2, random_state=42)

5.  Get the list of numerical predictors (all the attributes in the current data frame except the target, Purchase ) and the list of categorical predictor. [1 point] In [7]:



6.  Create a transformation pipeline including two pipelines handling the following [3 points]

Numerical predictors: apply Standard Scaling

Categorical predictor: apply One-hot-encoding

You will need to use ColumnTransformer . The example in Week 3 lectures may be helpful.

In [8]:

nom_onehot = [('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))] nom_pl = Pipeline(nom_onehot)

num_impute = SimpleImputer(strategy='mean') num_normalised = MinMaxScaler()

num_pl = Pipeline([('imp', num_impute), ('norm', num_normalised)])

num_cols = list(X_train.select_dtypes([np.number]).columns) nom_cols = list(set(X_train.columns) - set(num_cols))

transformers = [('num', num_pl, num_cols),

               ('nom', nom_pl, nom_cols)] col_transform = ColumnTransformer(transformers)

7.              Train and use that transformation pipeline to transform the training data (e.g. for a machinelearning model). [2 points] In [9]:



Out[9]:

array([[0.33333333, 0.        , 1.        , ..., 1.        , 0.        ,         0.        ],

       [0.33333333, 1.        , 0.        , ..., 1.        , 0.        ,         0.        ],

       [0.33333333, 1.        , 0.        , ..., 0.        , 1.        ,

        0.        ],

       ...,

       [0.16666667, 1.        , 0.        , ..., 0.        , 1.        ,         0.        ],

       [0.5       , 1.        , 0.        , ..., 1.        , 0.        ,         0.        ],

       [0.66666667, 1.        , 1.        , ..., 0.        , 1.        ,

        0.        ]])

8.              Use that transformation pipeline to transform the test data (e.g. for testing a machine learningmodel). [2 points]

In [10]:



Out[10]:

array([[0.5       , 1.        , 1.        , ..., 1.        , 0.        ,         0.        ],

       [0.16666667, 1.        , 1.        , ..., 0.        , 1.        ,         0.        ],

       [0.5       , 0.        , 1.        , ..., 0.        , 0.        ,         1.        ],

       ...,

       [0.33333333, 0.        , 0.        , ..., 0.        , 0.        ,         1.        ],

       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,         0.        ],

       [0.5       , 1.        , 0.        , ..., 0.        , 1.        ,

        0.        ]])

9.              Build a Linear Regression model using the training data after transformation and test it on the testdata. Report the RMSE values on the training and test data. [3 points]

Document: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [11]:

lr = LinearRegression()

lr_pipeline = Pipeline([('col_trans', col_transform), ('lr', lr)]) lr_pipeline.fit(X_train, y_train)

lr_train_pred = lr_pipeline.predict(X_train) lr_test_pred = lr_pipeline.predict(X_test)

print(" Linear Regression Training Set RMSE: %.4g" % np.sqrt(metrics.mean_squared_error

(lr_train_pred, y_train)))

print("Linear Regression Test Set RMSE: %.4g" % np.sqrt(metrics.mean_squared_error(lr_t est_pred, y_test)))

 Linear Regression Training Set RMSE: 4600

Linear Regression Test Set RMSE: 4616

10. Repeat Question 9 using a KNeighborsRegressor . Comment on the processing time and performance of the model in this question. [1 point]

Document: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

(https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

In [12]:

knn = KNeighborsRegressor()

knn_pipeline = Pipeline([('col_trans', col_transform), ('knn', knn)]) knn_pipeline.fit(X_train, y_train)

knn_train_pred  = knn_pipeline.predict(X_train) knn_test_pred = knn_pipeline.predict(X_test)

print("K Neighbours Regressor Training Set RMSE: %.4g" % np.sqrt(metrics.mean_squared_e rror(knn_train_pred, y_train)))

print("K Neighbours Regressor Test Set RMSE: %.4g" % np.sqrt(metrics.mean_squared_error (knn_test_pred, y_test)))

K Neighbours Regressor Training Set RMSE: 3407

K Neighbours Regressor Test Set RMSE: 4230

The K-Nearest Neighbours Regression is significantly slower than Linear Regression because in KNN each training instance has to be compared with every other training instance one-by-one, this makes it computationally expensive especially when your dataset is wide (a lot of features, which in our instance, it does). It's complexity is n × n because each instance has n comparisions.

KNN Regression also has poorer performance than the Linear Regression model, as observed by how RMSE for the training and test sets for Linear Regression are near identicial, it means there's little variance in the residuals. KNN Regression on the other hand has a pretty big discrepancy between the RMSE values, most likely due to a very terrible signal-to-noise ratio thanks to our large number of features. KNN works best when it comes to small datasets without too much noise

In [ ]:

More products