$30
# Digits
-------------------
In this assignment, your goal is to write a program that
can recognize handwritten digits. This is a classic problem
in pattern recognition and machine learning. The state of
the art techniques can now achieve a recognition rate of
over 99%. We, however, will implement a simple algorithm
called k-nearest neighbor algorithm, which has a lower
accuracy but is simpler.
## Supervised Machine Learning
The k-nearest neighbor algorithm belongs to a category of
algorithm called _supervised machine learning_ algorithms.
These algorithms teach or train the machine to learn to do
something, by showing the machine many samples. These are
called _training samples_. After showing the machine a
bunch of these training samples, we can then _test_ the
machine how well it has learn, by showing the machine some
_testing samples_.
In our case, we want to train the machine to recognize
handwritten digits. During the training phase, we will
present a set of handwritten digits in the form of images
(as 2D arrays) to the machine, together with a label
indicating which digits it is (i.e., "this picture shows
the digit 1, this picture shows the digit 4, etc.").
During the testing phase, we will then present a set of
handwritten digits to the machine, and ask, "ok, which
digit to you think this is?"
We will then compare the machine's answer with the label
of the handwritten digit (also called the _ground truth_)
to see whether the machine recognize correctly or not.
## Representation of a handwritten digit
A handwritten digit is an image, represented as a 28x28
array of characters, consisting of . and #. An example is:
```
............................
............................
............................
..........#######...........
.........########...........
........##########..........
.......####....###..........
.......###....####..........
..............####..........
.............####...........
............#####...........
............####............
...........####.............
...........###..............
..........####..............
.........####...............
.........####...............
........####................
........###.................
........###...........####..
........###################.
........###################.
.........###########........
............................
............................
............................
............................
............................
```
The image above contains the digit 2.
## Training Data
For this assignment, we will use the data provided by the
MNIST Handwritten Digit Database. The original data has
been post-procesed into CSV file by Joseph Redmon, then
post-processed again by yours truly into the format above.
The dataset from above contains 60,000 training samples. I
have created two smaller subsets for you to test your code.
The first subset contains 10 samples for each digit --
this is given in the file `train100.in`.
The second subset contains 6 samples for digits 0 and 1 --
this is given in the file `train6.in`.
The full training set from MNIST is too big to be included
on GitHub. You can read it from `~cs1010/as09/train60000.in`
if you want to play with it.
## Testing Samples
We provide two sets of testing samples. The first contains
three testing samples per digit. The second contains only
a single digit corresponding to the example below.
The full training dataset from MNIST contains
10000 digits. It takes a long time to test every
handwritten digit in this file, so you should do
this only after you have made sure that your code is
fast and is correct. Again, the file is too big to
be posted on GitHub. You can read it directly from
`~cs1010/as09/test10000.in`.
## The Algorithm
We will use the k-nearest neighbor algorithm for this
assignment. The idea of this algorithm is that, given
a testing sample, the machine will compare it to all
the training samples, and find out which digit this
testing sample is most similar to.
Let's define the distance between the two handwritten
digits d(x1,x2) as the number of pixels that are different
between them, i.e., how many pixels are `#` in one image
but is `.` in the other image.
Given a test sample, q, we find the distance between
q and all the available training samples and find the
k training samples with the smallest distance (i.e.,
k nearest neighbors).
k is usually small -- we use k=5 in this assignment.
The intuition is that q must be "close" to the training
data that has the same labels (i.e., the same digits). So
we look at the these k nearest neighbors and find the
most common digit d among them. We then recognize q as
containing the digit d. If there are more than one most
common digits or more than k nearest neighbors , then
we break ties by returning the nearest (i.e., shorter
distance) digit. In the case where even the distance is
the same, we return the smaller digit.
## Efficiency
Suppose we have n training samples, recognizing a digit
should take no more than O(kn) time (or O(n) since k is
a constant).
There is also an opportunity to stop early the calculation
of the distance between two handwritten digits if the
distance is too large, pruning away redundant work.
## Example
Consider the following simple example with six training
samples and a test sample.
```
6
0
............................
............................
............................
............................
...............#####........
...............#####........
.............########.......
............##########......
...........###########......
..........########.###......
..........####.##...###.....
.........#####......###.....
........####........###.....
.......####.........###.....
.......###..........###.....
......####..........###.....
......###..........###......
......###.........####......
......###........###........
......###......####.........
......####..#######.........
......###########...........
.......########.............
........#####...............
............................
............................
............................
............................
0
............................
............................
............................
............................
.................####.......
................######......
...............#######......
.............#########......
...........###########......
..........############......
.........#############......
........#####.####.###......
.......#####..###...##......
.......####.........##......
.......##...........##......
......###...........##......
.....####..........###......
.....####..........###......
.....####.........###.......
.....####.......####........
.....##############.........
......############..........
........########............
.........######.............
............................
............................
............................
............................
0
............................
............................
............................
............................
............................
..............######........
..............########......
.............#########......
.............#########......
............##########......
..........######..####......
..........#####...####......
..........#####...####......
..........#####..#####......
..........####...#####......
.........####....####.......
........#####...#####.......
........#####...####........
........#####..#####........
........####..#####.........
........##########..........
........#########...........
........#########...........
.........######.............
..........#####.............
............................
............................
............................
1
............................
............................
............................
............................
............................
..................####......
.................#####......
.................#####......
................####........
...............#####........
...............####.........
...............####.........
..............####..........
.............####...........
............####............
............####............
...........####.............
...........####.............
..........####..............
..........####..............
.........####...............
.........####...............
........####................
........####................
.........###................
............................
............................
............................
1
............................
............................
............................
............................
............###.............
............####............
............####............
............####............
............####............
............####............
............#####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............#####..........
.............#####..........
..............####..........
..............####..........
..............####..........
...............###..........
............................
............................
............................
............................
1
............................
............................
............................
............................
............................
.............##.............
.............##.............
.............##.............
.............##.............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
..............###...........
..............##............
..............##............
..............###...........
..............##............
..............##............
............................
............................
............................
```
The test sample is
```
1
0
............................
............................
............................
............................
............................
..............#####.........
.............#######........
...........##########.......
..........####.##.####......
.........####......####.....
.........###........###.....
........####.........##.....
........###..........##.....
.......###...........##.....
.......###...........##.....
.......###...........##.....
......####...........##.....
......####..........###.....
.......###..........###.....
.......###.........###......
........###.......###.......
........####....#####.......
.........##########.........
..........#########.........
............#####...........
............................
............................
............................
```
The distances between the test sample and each of the
training sampes are 101, 120, 162, 174, 173, 162. The
k nearest neighbors belong to digits 0, 0, 0, 1, and
1. The most common digit among these neighbors is 0,
so we conclude (correctly) that the test sample is
digit 0.
## Using struct
One of the objectives of this assignment is to see if you
know how to use `struct`.
For this assignment, you must define and use two `struct`:
- a `struct digit` that encapsulates the 2D array that
represents the handwritten digit and its label,
- a `struct neighbor` that encapsulates a neighboring
digit and the distance to it.
## Input/Output
Write a program digits that reads, from the standard input,
the following:
- A positive integer n, corresponding to the number of
training samples, then repeatedly read n handwritten
digits, containing:
- a label corresponding to the digit in the next image
(a number between 0 - 9)
- 28 lines of texts, consisting of '.' and '#' only,
representing a handwritten digit
- followed by another positive integer m, corresponding to
the number testing samples, then repeatedly read m
handwritten digits, containing:
- a label corresponding to the digit in the next image
(a number between 0 - 9)
- 28 lines of texts, consisting of '.' and '#' only,
representing a handwritten digit
Then prints, to the standard output, the following:
For each testing sample,
- print the digit it is labeled as, followed by a space,
followed by the digit it is recognized as.
Finally, print a double value that corresponds to the
accuracy, i.e, the percentage of training testing samples
correctly recognized.
We separated the training samples and testing samples into
two files, so the usual way of redirecting a file into your
program does not work anymore (since you need two files).
The way to run your program is to do the following:
```
cat <training samples> <testing samples> | ./digits
```
We use `cat` which concatenate two files into one to pass
both the training samples and testing samples into the
program using the pipe `|`.
The name convention for the output files is different for
this assignment. The name is formatted as X-Y.out where X
refers to the training samples trainX.in and Y refers to
the testing samples testY.in.
## Sample Runs
```
ooiwt@pe121:~/cs1010/as09$ cat inputs/train6.in inputs/test1.in | ./digits
0 0
100.0000
ooiwt@pe121:~/cs1010/as09$ cat inputs/train100.in inputs/test30.in | ./digits
0 0
0 0
0 6
1 1
1 1
1 1
2 2
2 5
2 2
3 3
3 3
3 3
4 9
4 4
4 9
5 5
5 3
5 3
6 2
6 6
6 6
7 7
7 7
7 7
8 8
8 1
8 8
9 9
9 9
9 9
73.3333
```