This lab revisits important concepts covered and aims to make you familiar with implementing specific algorithms.
Pattern Recognition To goal of this lab is to implement a K-Nearest Neighbours (KNN) classifier, a Stochastic Gradient Descent (SGD) classifier, and a Decision Tree (DT) classifier. Background information on these classifiers is provided at the end of this document.
The experiments in this lab will be based on scikit-learn’s digits data set which was designed to test classification algorithms. This data set contains 1797 low-resolution images (8 × 8 pixels) of digits ranging from 0 to 9, and the true digit value (also called the label) for each image is also given (see examples on the next page).
We will predominantly be using scikit-learn for this lab, so make sure you have downloaded it. The following scikit-learn libraries will need to be imported:
from sklearn import metrics from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import SGDClassifier from sklearn.tree import DecisionTreeClassifier
Sample of the first 10 training images and their corresponding labels.
Task : Perform image classification on the digits data set.
Develop a program to perform digit recognition. Classify the images from the digits data set using the three classifiers mentioned above and compare the classification results. The program should contain the following steps:
Set Up Step 1. Import relevant packages (most listed above).
Step 2. Load the images using sklearn’s load_digits().
Optional: Familiarize yourself with the dataset. For example, find out how many images and labels there are, the size of each image, and display some of the images and their labels. The following code will plot the first entry (digit 0):
Step 3. Split the images using sklearn’s train_test_split()with a test size anywhere from 20% to 30% (inclusive).
Classification For each of the classifiers (KNeighborsClassifier, SGDClassifier, and DecisionTreeClassifier) perform the following steps:
Step 4. Initialize the classifier model.
Step 5. Fit the model to the training data.
Step 6. Use the trained/fitted model to evaluate the test data.
Evaluation Step 7. For each of the three classifiers, evaluate the digit classification performance by calculating the accuracy, the recall, and the confusion matrix.
Experiment with the number of neighbours used in the KNN classifier in an attempt to find the best number for this data set. You can adjust the number of neighbours with the n_neighbours parameter (the default value is 5).
Print the accuracy and recall of all three classifiers and the confusion matrix of the bestperforming classifier. Submit a screenshot for marking (see the example below for the case of just a 6-class model). Also submit your code and include a brief justification for the chosen parameter settings for KNN.
Background Information
K-Nearest Neighbours (KNN) The KNN algorithm is very simple and very effective. The model representation for KNN is the entire training data set. Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbours) and summarizing the output variable for those K instances. For regression problems, this might be the mean output variable, for classification problems this might be the mode (or most common) class value. The trick is in how to determine the similarity between the data instances.
A 2-class KNN example with 3 and 6 neighbours (from Towards Data Science).
Similarity: To make predictions we need to calculate the similarity between any two data instances. This way we can locate the K most similar data instances in the training data set for a given member of the test data set and in turn make a prediction. For a numeric data set, we can directly use the Euclidean distance measure. This is defined as the square root of the sum of the squared differences between the two arrays of numbers.
Parameters: Refer to the scikit-learn documentation for available parameters.
Decision Tree (DT) See https://en.wikipedia.org/wiki/Decision_tree_learning for more information.
The algorithm for constructing a decision tree is as follows:
1. Select a feature to place at the node (the first one is the root).
2. Make one branch for each possible value.
3. For each branch node, repeat Steps 1 and 2.
4. If all instances at a node have the same classification, stop developing that part of the tree.
How to determine which feature to split on in Step 1? One way is to use measures from information theory such as Entropy or Information Gain as explained in the lecture.
Stochastic Gradient Descent (SGD)
See https://scikit-learn.org/stable/modules/sgd.html for more information.
Experiment with Different Classifiers See https://scikit-learn.org/stable/modules/multiclass.html for more information. There are many more models to experiment with. Here is an example of a clustering model: