$30
- Machine Learning for Science
Step 1 - Getting started
In order to be able to access the challenges, please first create an account at AIcrowd.com using your @epfl.ch email address. Pick your favorite competition among the following two. To read the description and download the dataset, please follow the corresponding links:
https://www.aicrowd.com/challenges/epfl-ml-text-classification https://www.aicrowd.com/challenges/epfl-ml-road-segmentation
For the two possible tasks, we provide some additional description and sample code on the course github:
Step 2 - Implement ML Methods
You are allowed to use any external library and ML techniques, as long as you properly cite any external code used.
hub).
Do not use the AIcrowd score as the only estimate of your error. Instead, always estimate your test error by using a local validation set, or local cross-validation! This is important to avoid overfitting the test set online. Also, it allows you to make experiments faster, and save uploading bandwidth :).
• Code: The complete executable and documented Python code, as a github repository link with your group of students (here is the github classroom invite link). Rules for the code part:
– Documentation: Your ML system must be clearly described in your PDF report and also welldocumented in the code itself. A clear ReadMe file must be provided. The documentation must also include all data preparation, feature generation as well as cross-validation steps that you have used.
– External ML libraries are allowed, as long as accurately cited and documented.
– External datasets are allowed, as long as accurately cited and documented.
Option C - ML Reproducibility Challenge
Your team participates in the ML Reproducibility Challenge 2021. Here the goal is to select a new submitted research paper, and try to reproduce (parts of) its experiments, for example from NeurIPS, ICML, ICLR, ACLIJCNLP, EMNLP, CVPR, ICCV, AAAI and IJCAI:
- Solid comparison baselines supporting your claims
Quantify the benefits of your method by providing clear quality measurements of the most important aspects and additions you chose for your model. Start with a very basic baseline, and demonstrate what improvements your contributions yield.
- Scientific novelty and creativity
You will likely be using more than the standard methods we saw in the first half of the course. To communicate that your methods work and that you understand them, you should make sure that your report makes clear the following points.
– What specific problem your method is intended to solve.
By specific, we do not mean “image classification” but what specific issue with your current model are you trying to improve with this method.
– Why is this an important problem? Why are you solving this one instead of something else?
– How is your method helping?
– What are the results of your method? Compare the error before and after.
- Writeup quality Some advice:
– Try to convey a clear story giving the most relevant aspects of your approach, in a reproducible way. Learning what has not worked can additionally help the reader (and help them better understand why you have made the many choices you did), but focus on what is most relevant for your final solution.
– Before the submission, have an external person proofread your report. It is easy to write a sentence that makes perfect sense to you since you wrote it but is actually hard to parse. Use a spell-checker.
– Plots are great way to share information that might be hard to convey by writing. Make sure that your plots are understandable, have labels for axes, a title, correct axes limits, add a description of what your plot is about and what can be learned from it.
• Competitive Part (only for option B)). The final rank of your team in the (private) leaderboard will be translated linearly to a scale from 4 to 6.
As usual, your code and report will be automatically checked for plagiarism.
Guidelines for Machine Learning Projects
Now that you have implemented few basic methods, you should use this toolbox on the dataset. Here are a few things that you might want to try.
Exploratory data analysis You should learn about your dataset - figure out which features are continuous, which ones are categorical, check if there are obvious relationships between the features, take a look at the distribution of each feature, and so on. Check https://en.wikipedia.org/wiki/Exploratory_data_analysis.
Feature processing Cleaning your dataset by removing useless features and values, combining others, finding better representations of the features to feed your model, scaling the features, and so on. Check this article on feature engineering: http://machinelearningmastery.com/discover-feature-engineering-how-toengineer-features-and-how-to-get-good-at-it/.
Determining whether a method overfits or underfits You should be able to diagnose the whether your model is over- or underfitting the data and take actions to fix the problems with your model. Recommended reading: Advice on applying machine learning methods by Andrew Ng: http://cs229.stanford.edu/materials/MLadvice.pdf.
Applying methods and visualizing Beyond simply applying the models we have seen, it helps to try to understand what the ML model is doing. Try to find out which datapoints are wrongly classified and, if possible, why this is the case. Then use this information to improve your model. Check Peter Domingo’s Useful things to know about machine learning: http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Accurately estimate how well your method is doing By applying cross-validation and estimating the test error.
Report Guidelines
In addition to finding a good model for the data, you will need to explain your methodology in a report.
Clearly describe your used methods, state your conclusions and argue that the results you obtained make (or do not make) sense, and the reasons behind it. Keep the report short and to the point, with a strict limit of 4 pages. References are allowed to be put on an extra page.
To get started more easily with writing the report, we provide you a LaTeX template here
github.com/epfml/ML course/tree/master/projects/project1/latex-example-paper
The file also contains some more helpful information on how to write a scientific report or paper. We will also help you learn it during the exercise session and office hours if you ask us.
For more guidelines on what makes a good report, see the grading criteria above. In particular, don’t forget to take care about
- Reproducibility: Not only in the code, but also in the report, do include complete details about each algorithm you tried, e.g. what lambda values you used for ridge regression? How exactly did you do that feature transformation? how many folds did you use for cross-validation? etc...
- Baselines: Give clear experimental evidence: When you added this new combined feature, or changed the regularization, by how much did that increase or decrease the test error? It is crucial to always report such obtained differences in the evaluation metrics, and to include several properly implemented baseline algorithms as a comparison to your approach.
Some additional resources on LaTeX:
• https://github.com/VoLuong/Begin-Latex-in-minutes - getting started with LaTeX
• http://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/ - tutorial on LaTeX
• http://www.stdout.org/~winston/latex/latexsheet-a4.pdf - cheat sheet collecting most of all useful commands in LaTeX
• http://mirror.switch.ch/ftp/mirror/tex/info/first-latex-doc/first-latex-doc.pdf - example how to create a document with Latex
• http://en.wikibooks.org/wiki/LaTeX - detailed tutorial on LaTeX
Producing figures for LaTeX in Python
There are some good visualization tools in Python. “matplotlib” is probably the single most used Python package for 2D-graphics. The relevant tutorials are as follow:
• Matplotlib tutorial: http://www.labri.fr/perso/nrougier/teaching/matplotlib/
• Matplotlib tutorial: https://sites.google.com/site/scigraphs/tutorial
• Matplotlib Tutorial: http://jakevdp.github.io/mpl_tutorial/
Regarding other useful Python data visualization libraries, please refer to this blog for