• Select 2 large data sets (minimum: 30K rows). Good sources are listed at the end of this document. One should be suitable for regression and the other for classification.
For each data set, your project will be evaluated as follows:
• You will get more points for larger/messier data sets:
o 0-5 pts <30K o 6-10 pts =30K
• Data cleaning:
o provide a link where you found the data
o describe what steps you had to do for data cleaning (more points for messier data that needed cleaning)
• Data exploration:
o use at least 5 R functions for data exploration o create at least 2 informative R graphs for data exploration
• Run at least 3 ML algorithms on each data set, using at least 5 algorithms in all.
o this portion of your R script should include:
§ code to run the algorithms
§ commentary on feature selection you performed and why
§ code to compute your metrics for evaluation as well as commentary discussing the results
• Run at least one ensemble method such as Random Forest, XGBoost o this portion of your R script should include:
§ code to run the algorithms
§ commentary on feature selection you performed and why
§ code to compute your metrics for evaluation as well as commentary discussing the results
• Results analysis o rank the algorithms from best to worst performing on your data o add commentary on the performance of the algorithms
o your analysis concerning why the best performing algorithm worked best on that data
o commentary on what your script was able to learn from the data (big picture) and if this is likely to be useful
• Project depth o 0-3 project minimally meets requirements o 4-6 project exceeds minimum requirements o 7-10 project went well above the requirements