These days the whole process of renting a bike is automated, anyone can rent a bike at a location and return it to some other location. The companies which are in bike-rental business are interested in knowing the number of bikes that will be rented given the date, time of day, weather conditions, temperature, humidity etc. Your task is to find the best linear regression model that can predict with good accuracy the number of bike rental count given the conditions.
The train file consist 13865 rows of 13 columns. The test set consists of 3514 rows with 12 columns (the features). The last column is held out for checking.
The assignment has 2 parts
(a) Implement the class LinearRegressor. There are 3 functions to implement __init__, train and predict . The primary loss function used for linear regression is mean squared error. So you are required to implement mean_squared_loss and mean_squared_gradient. We have provided the basic function to read data from csv. You need to implement preprocess_dataset which will process the dataset so that it can be used for training. (ie converting strings to floats or int, converting categorical variables to suitable value, or fancy things such as feature scaling etc)
(b) Experimenting with different loss functions. In this part we ask you to implement the following 3 loss functions along with their gradient functions
To know what these loss functions are please refer to these https://en.wikipedia.org/wiki/Mean_squared_error https://en.wikipedia.org/wiki/Mean_absolute_error
https://memoex.github.io/note/tech/ml/loss/ (RMSE and logcosh loss given)
After implementing the above losses you are required a plot a graph showing different losses as a function of epoch. Do the following -
Train your model using the 4 different gradient functions to update model weights. After each epoch find the mean_squared_loss of the model on the complete training dataset. Plot this mean_squared_loss as a function of epoch for each gradient function in the same graph (use legends in matplotlib). For the above task you have keep the learning rate same across all 4 gradient update functions. Choose a suitable learning rate so that none of the loss function diverges. Choose number of epochs such that we can see the losses converging. Note that even though you use different gradient function, we calculate the same loss (mean_squared_error) per epoch. This is done because
these losses would differ in their absolute magnitude and hence not suitable for comparison). Save
this graph with name comparison.jpg
(c) For the final part you are required to tune the parameters (lr, epochs etc) to build the best classifier and perform well on the kaggle leaderboard. You can play around with preprocess_dataset to scale/modify/normalize features according to your wish. For example if you have a categorical variable which takes three values {“Red”,”Green”,”Blue”} you can either convert it into ordinal variables giving values {Red = 0, Green = 1, Blue = 2} or you could add more columns and one-hot encode the categorical variables e.g Red = (1,0,0), Green = (0,1,0) and Blue = (0,0,1). You are allowed to drop or add your own features as well. Kaggle will calculate mean squared error loss between your predicted y and true y on test dataset. Submit main.py with the appropriate hyperparameters so that on running python3 main.py --train_data train.csv --test_data test.csv
Your model prints out the predicted output on test data. Also make sure this runs within 10 minutes.