$25
Projects
Machine Learning Course
EPFL
Introduction
In this project, you will learn to use the concepts we have seen in the lectures and practiced in the labs on a real-world dataset, start to finish. You will do exploratory data analysis to understand your dataset and your features, do feature processing and engineering to clean your dataset and extract more meaningful information, implement and use machine learning methods on real data, analyze your model and generate predictions using those methods and report your findings.
Step 1 - Getting Started
Create an account using your epfl.ch email and head over to the competition arena
https://www.aicrowd.com/challenges/epfl-machine-learning-higgs-2019
Then, download the training dataset, available in .csv format. To load the data, use the same code we used during the labs. You can find an example of a .csv loading function in our provided template code from labs 1 and 2.
Step 2 - Implement ML Methods
We want you to implement and use the methods we have seen in class and in the labs. You will need to provide working implementations of the functions in Table 1. If you have not finished them during the labs, you should start by implementing the first ones to have a working toolbox before diving in the dataset.
Function
Details
least squares GD(y, tx, initial w, max iters, gamma)
Linear regression using gradient descent
least squares SGD(y, tx, initial w, max iters, gamma)
Linear regression using stochastic gradient descent
least squares(y, tx)
Least squares regression using normal equations
ridge regression(y, tx, lambda )
Ridge regression using normal equations
logistic regression(y, tx, initial w, max iters, gamma)
Logistic regression using gradient descent or SGD
reg logistic regression(y, tx, lambda , initial w, max iters, gamma)
Regularized logistic regression using gradient descent or SGD
Table 1: List of functions to implement. In the above method signatures, for iterative methods, initial w is the initial weight vector, gamma is the step-size, and max iters is the number of steps to run. lambda is always the regularization parameter. (Note that here we have used the trailing underscore because lambda is a reserved word in Python with a different meaning). For SGD, you must use the standard mini-batch-size 1 (sample just one datapoint).
You should take care of the following:
• Return type: Note that all functions should return: (w, loss), which is the last weight vector of the method, and the corresponding loss value (cost function). Note that while in previous labs you might have kept track of all encountered w for iterative methods, here we only want the last one.
• File names: Please provide all function implementations in a single python file, called implementations.py.
• All code should be easily readable and commented.
• Note that we might automatically call your provided methods and evaluate for correct implementation
Here are some good practices of scientific computing as a reference: http://arxiv.org/pdf/1609.00037 or an older article http://arxiv.org/pdf/1210.0530.