For this homework you will be implementing 2 machine learning algorithms in C++ and comparing the results and performance to the equivalent functions in R.
For this homework you can work with one other person or work alone if you prefer.
Steps:
1. Perform logistic regression on the given data set in an R script (not Rmd) using R library functions. Evaluate with the metrics indicated in details below. Your R script should also include at least 2 graphs and 4 R functions for data exploration.
2. Write a C++ program to implement logistic regression from scratch, and evaluate with the metrics indicated in details below.
3. Perform naive Bayes on the given data set in an R script (not Rmd) using R library functions. Evaluate with the metrics indicated in details below. Your R script should also include at least 2 graphs and 4 R functions for data exploration.
4. Write a C++ program to implement naive Bayes from scratch, and evaluate with the metrics indicated in details below.
5. Report. Write a summary of the accuracy and performance (run time) of the two approaches. Include screen shots of the R runs and the C++ runs for each algorithm. Cite references (any format) you used for the algorithm, including coding examples. Include screen shots of your R graphs. No particular format is required for either the report or references.
Notes:
• Indicate in your summary how you computed run times. Here are some suggestions:
o For the R scripts you can use proc.time() at the start and end of the machine learning part of the script and subtract the difference.
o For the C++ programs, your IDE may give run time, otherwise measure from terminal.
o Windows: https://stackoverflow.com/questions/673523/how-do-i-measure-execution-timeof-a-command-on-the-windows-command-line
o Mac: https://stackoverflow.com/questions/26466572/mac-os-x-shell-script-measure-timeelapsed
Note: The timing for the R code should be only that portion running the algorithm, not parts that run data exploration functions or create graphs.
Details: Logistic Regression
• Data: plasma in library HSAUR. You will need to export it using write.csv() for your C++ program.
Use all the data (32 observations) to build the model.
• R script:
o train a logistic regression model on all the data, ESR~fibrinogen, using glm() o print the coefficients of the model o build the model “from scratch” in R as shown in the book o make sure you get the same coefficients in each approach o note that we are not doing test set evaluation on this data
• C++ program:
o implement in C++ the same steps for logistic regression from scratch o feel free to use whatever data structures you like: arrays, vectors, etc.
o if you have a linux system, you may want to check out the Armadillo library for matrix multiplication: http://arma.sourceforge.net/
o feel free to use whatever programming paradigm you like, but make your C++ code fast
Details: Naïve Bayes
• Data: Titanic data set “titanic_project.csv” on Piazza. Use the first 900 observations for train, the rest for test.
• R script:
o train a naïve Bayes model on the train data, survived~pclass+sex+age o print the model, which will show all the probabilities learned from the data o test on the test data
o print metrics for accuracy, sensitivity, specificity
• C++ program:
o implement naïve Bayes in C++; the code in the book should help o train/test on the same data as in the R script; output the same metrics o feel free to use whatever data structures you like: arrays, vectors, etc. o Here is a great video that gives a conceptual picture of naïve Bayes with Gaussian predictors: https://www.youtube.com/watch?v=r1in0YNetG8
o The following formula shows how to calculate the likelihood of a continuous predictor. The book gives hints as well..
• Report o Write a summary of the two implementations, R and C++. Did you get the same results?
How do the run times compare? How did you measure execution time?
o Include screen shots of the output of each program o Include screen shots of the run times of each program o Write out the algorithm you used for training the classifier o Cite all references used o No required format for the report