Starting from:

$30

GR5243-Midterm Project Solved

The project will focus on an image recognition problem. You will construct a variety of machine learning models with the goal of generating predictive classifications.

Details
·         Group Project: This project will be completed in groups. You may use any publicly available written resources to help, but please cite your sources.

·         Materials to Submit: You must write a report in RMarkdown format. Please provide the code file (in Rmd format) and the output file (HTML preferred). We are supplying a template file to help you get started. The final report should be of moderate length (roughly 2000-3000 words, not strictly enforced). Include all of the code you used, and display any relevant tables, plots, or other supplementary materials.

In your report, you need to give a detailed explanation of each step you take to arrive at your solution. Please give a justification or explanation of the results you obtain. The reports should also include the lessons you may have learned from doing the project.

The Data
The MNIST Fashion database (https://github.com/zalandoresearch/fashion-mnist) collected a large number of images for different types of apparel. Each image is divided into small squares called pixels of equal area. Within each pixel, a brightness measurement was recorded in grayscale. The brightness values range from 0 (white) to 255 (black). The original data set divided each image into 784 (28 by 28) pixels. As an example, one such image would be divided as follows:

A 28-by-28 Pixel Image
 

To facilitate more tractable computations, we have condensed these data into 49 pixels (7 by 7) per image. An example of dividing an image is shown below:

A 7-by-7 Pixel Image
 

For each picture, the first 7 pixels represent the top row, the next 7 pixels form the second row, etc.

The assignment provides the following files:

Training Set: MNIST-fashion training set-49.csv. This file contains 60,000 rows of data.
Testing Set: MNIST-fashion testing set-49.csv. This file contains 10,000 rows of data.
Each file includes the following columns:

·         label: This provides the type of fashionable product shown in the image.

·         pixel1, pixel2, …, pixel49: These columns provide the grayscale measurement for the 49 pixels of the image.

A Practical Machine Learning Challenge
What are the best machine learning models for classifying the labels of the testing set based upon the data of the training set? How small of a sample size do you need to generate the “best” predictions? How long does the computer need to run to obtain good results? To balance these competing goals, we will introduce an overall scoring function for the quality of a classification.

Points = 0.25 * A + 0.25 * B + 0.5 * C

where

A is the proportion of the training rows that is utilized in the model. For instance, if you use 30,000 of the 60,000 rows, then A = 30,000 / 60,000 = 0.5;

B = \(\min\left(1, \frac{X}{60}\right)\), where X is the running time of the selected algorithm in seconds. Algorithms that take at least 1 minute to run will have the value B = 1, which incurs the full run-time penalty.

C is the proportion of the predictions on the testing set that are incorrectly classified. For instance, if 1000 of the 10000 rows are incorrectly classified, then C = 1000 / 10000 = 0.1.

You will create and evaluate different machine learning models on different sample sizes. The quality of different combinations of the models and the sample sizes can be compared based on their Points. The overall goal is to build a classification method that minimizes the value of Points. In this setting, the ideal algorithm would use as little data as possible, implement the computation as quickly as possible, and accurately classify as many items in the testing set as possible. In practice, there are likely to be trade-offs. Part of the challenge will be adapting to the rules of the game to improve your score!

The Rules
·         You may select 3 different sample sizes to work with.

·         For each selected sample size, you will generate 3 separate model development sets by sampling from the rows of the overall training data randomly without replacement. (You may use the sample function in R.) If the full sample size of the training data is selected, then please select your three model development sets by drawing randomly with replacement from the full training data set. As an example, you may consider using the following table names to create the 9 model development data sets:

Then, on the 9 model development data sets, you will conduct the following work:

You will build 10 different classification models using machine learning techniques on each of the 9 model development data sets. The selected methods must include at least 5 of the following techniques:
Multinomial logistic regression (Package: nnet; Function: multinom)
K-Nearest Neighbors (Package: class; Function: knn)
Classification Tree (Package: rpart; Function: rpart)
Random Forest (Package: randomForest; Function: randomForest)
Ridge Regression (Package: glmnet; Function: glmnet with alpha = 0)
Lasso Regression (Package: glmnet; Function: glmnet with alpha = 1)
Support Vector Machines (Package: e1071; Function: svm)
Generalized Boosted Regression Models (Package: gbm; Function: gbm or Package: xgboost; Function: xgboost)
Neural Networks (Package: nnet; Function: nnet)
Many of these techniques may also be implemented using other libraries such as caret. The requirement of 5 separate modeling techniques applies to the choice of methods; it does not constrain your choice of which packages are used to implement the methods.

·         Multiple variants of the same technique (e.g. K-Nearest Neighbors with k = 5 and k = 10) would each count separately toward your 10 models. The variants must have different parameter settings to be considered as distinct models.

·         Any other machine learning technique is also allowed.

·         At least one model must be an ensembling model that combines the results of some of the other models you have built. For this model, it may not be constructed from the original data. Instead, the predictions from the other selected models will be the inputs. This can be as simple as averaging the predictions from two or more other algorithms. You are also welcome to use these inputs in building a more complex model.

·         For each of the 9 model development sets, you will fit all 10 of the selected classification models. This means that a total of 90 separate models will be considered.

·         For each of the 90 fitted models, the sample size proportion A, the running time score B and the misclassification rate C will be computed and recorded.

As an example, with selected sample sizes of 500, 1000, and 2000, the results of the 90 models may be recorded as follows:

 
We will evaluate the results of a model at a selected sample size by averaging the values of A, B, C, and Points across the 3 randomly sampled data sets. To do this, compute the mean of these quantities grouped by the Model and Sample Size.

·         Then you will report an overall scoreboard of your average results for the 30 combinations of Model and Sample Size:

 
The values of A, B, C, and Points should all be rounded to 4 decimal places.

·         The values on the scoreboard should be sorted in increasing order of the Points scored.

More products