$25
Analytical Part
Suppose x = (x1,x2,··,xd) and z = (z1,z2,···,zd) be any two points in a high-dimensional space (i.e., d is very large).Try to prove the following, where the right-hand side quantity represent the standardEuclidean distance.
(1)
Hint: Use Jensen’s inequality – If X is a random variable and f is a convex function, then f(E[X]) ≤ E[f(X)].
We know that the computation of nearest neighbors is very expensive in the highdimensional space. Discuss how we can make use of the above property to make the nearest neighbors computation efficient?
We briefly discussed in the class about Locality Sensitive Hashing (LSH) algorithm to make the nearest neighbor classifier efficient. Please read the following paper and briefly summarize the key ideas as you understood:
Alexandr Andoni, Piotr Indyk: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of ACM 51(1): 117-122 (2008) http:// people.csail.mit.edu/indyk/p117-andoni.pdf
We know that we can convert any decision tree into a set of if-then rules, where there is one rule per leaf node. Suppose you are given a set of rules R = {r1,r2,··,rk}, where ri corresponds to the ith rule. Is it possible to convert the rule set R into an equivalent decision tree? Explain your construction or give a counterexample.
Please read the following paper and briefly summarize the key ideas as you understood (You can skip the proofs, but it is important to understand the main results):
Andrew Y. Ng, Michael I. Jordan: On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes. NIPS 2001: 841-848 http://ai.stanford.edu/~ang/ papers/nips01-discriminativegenerative.pdf
Naive Bayes vs. Logistic RegressionLet us assume that the training data satisfies the Naive Bayes assumption (i.e., features areindependent given the class label). As the training data approaches infinity, which classifier will produce better results, Naive Bayes or Logistic Regression? Please explain your reasoning.
Let us assume that the training data does NOT satisfy the Naive Bayes assumption. As the training data approaches infinity, which classifier will produce better results, Naive Bayes or Logistic Regression? Please explain your reasoning.
Can we compute P(X) from the learned parameters of a Naive Bayes classifier? Please explain your reasoning.
Can we compute P(X) from the learned parameters of a Logistic Regression classifier? Please explain your reasoning.
Please read the following paper and briefly summarize the key ideas as you understood:
Thomas G. Dietterich (1995) Overfitting and under-computing in machine learning. Computing Surveys, 27(3), 326-327. http://www.cs.orst.edu/~tgd/publications/cs95.ps.gz
We need to perform statistical tests to compare the performance of two learning algorithms on a given learning task. Please read the following paper and briefly summarize the key ideas as you understood:
Thomas G. Dietterich: Approximate Statistical Test For Comparing Supervised Classification Learning Algorithms. Neural Computation 10(7): 1895-1923 (1998) http://sci2s.ugr.es/ keel/pdf/algorithm/articulo/dietterich1998.pdf
Programming and Empirical Analysis
Fortune Cookie Classifier[1]
You will build a binary fortune cookie classifier. This classifier will be used to classify fortune cookie messages into two classes: messages that predict what will happen in the future (class 1) and messages that just contain a wise saying (class 0). For example,
“Never go in against a Sicilian when death is on the line” would be a message in class 0. “You will get an A in Machine learning class” would be a message in class 1.
Files Provided There are three sets of files. All words in these files are lower case and punctuation has been removed.
The training data:
traindata.txt: This is the training data consisting of fortune cookie messages. trainlabels.txt: This file contains the class labels for the training data.
The testing data:
testdata.txt: This is the testing data consisting of fortune cookie messages. testlabels.txt: This file contains the class labels for the testing data.
A list of stopwords: stoplist.txt
There are two steps: the pre-processing step and the classification step. In the pre-processing step, you will convert fortune cookie messages into features to be used by your classifier. You will be using a bag of words representation. The following steps outline the process involved:
Form the vocabulary. The vocabulary consists of the set of all the words that are in the training data with stop words removed (stop words are common, uninformative words such as “a” and “the” that are listed in the file stoplist.txt). The vocabulary will now be the features of your training data. Keep the vocabulary in alphabetical order to help you with debugging.
Now, convert the training data into a set of features. Let M be the size of your vocabulary. For each fortune cookie message, you will convert it into a feature vector of size M. Each slot in that feature vector takes the value of 0 or 1. For these M slots, if the ith slot is 1, it means that the ith word in the vocabulary is present in the fortune cookie message; otherwise, if it is 0, then the ith word is not present in the message. Most of these feature vector slots will be 0. Since you are keeping the vocabulary in alphabetical order, the first feature will be the first word alphabetically in the vocabulary.
Implement the Naive Bayes Classifier (with Laplace Smoothing) and run it on the training data. Compute the training and testing accuracy.
To debug and test your implementation, you can employ Weka (weka.classifiers.bayes.NaiveBayes):
http://www.cs.waikato.ac.nz/ml/weka/downloading.html or scikit-learn (http://scikit-learn. org/stable/modules/naive_bayes.html)
Convolutional Neural Networks (CNNs) for solving image classification task. You will train a CNN on Fashion MNIST data. The network architecture contains 4 CNN layers followed by one pooling layer and a final fully connected layer. The basic architecture (in sequential order) will be as follows:
First CNN layer: input channels - 1, output channels - 8, kernel size = 5, padding = 2, stride = 2 followed by ReLU operation
Second CNN layer: input channels - 8, output channels - 16, kernel size = 3, padding = 1, stride = 2 followed by ReLU operation
Third CNN layer: input channels - 16, output channels - 32, kernel size = 3, padding = 1, stride = 2 followed by ReLU operation
Fourth CNN layer: input channels - 32, output channels - 32, kernel size = 3, padding = 1, stride = 2 followed by ReLU operation
one “Average” pooling layer (nn.AdaptiveAvgPool2d(1) would work in PyTorch)
Fully connected layer (nn.Linear in PyTorch) - determine the number of input features from previous CNN layers. This can be done easily by hand. The number of output features will be equal to number of classes, i.e., 10. If you want help, you can use the direct formula given on this page:
This will be a straightforward extension from the code discussed in the demo session. Plot the training and testing accuracy as a function of atleast 10 epochs. You could use a smaller sized dataset if compute power is a hurdle. A good choice would be 50 percent of the training set and 10 percent of the testing set. Please make sure you have equal ratio of all classes in the dataset. You can try all tips mentioned in the demo session for solving this task. Optionally, it will be a good idea to try adding other training techniques to see the maximum accuracy possible. Some of them include batch normalization, data augmentation, using other optimizers like ADAM etc.