$25
Problem 1 - Linear Separability
Consider a dataset with two features x1 and x2 in which the points (−1,−1),(1,1),(−3,−3),(4,4) belong to one class and (−1,1),(1,−1),(−5,2),(4,−8) belong to the other.
1. Is this dataset linearly separable ? Can a linear classifier be trained using features x1 and x2 to classify this data set ? You can plot the dataset points and argue. (2)
2. Can you define a new 1-dimensional representation z in terms of x1 and x2 such that the dataset is linearly separable in terms of 1-dimensional representation corresponding to z ? (4)
3. What does the separating hyperplane looks like ? (2)
4. Explain the importance of nonlinear transformations in classification problems. (2)
Problem 2 - Bias Variance Tradeoff, Regularization
1. Derive the bias-variance decomposition for a regression problem, i.e., prove that the expected mean squared error of a regression problem can be written as
E[MSE] = Bias2 + V ariance + Noise
Hint: Let be the true (unknown) relationship and ˆy = g(x) be the model predicted value of y. Then MSE over test instance xi, i = 1,...,t, is given by:
2. Consider the case when y(x) = x + sin(1.5x) + N(0,0.3), here f(x) = x + sin(1.5x) and
Create a dataset of size 20 points by randomly generating samples from y. Display the dataset and f(x). Use scatter plot for y and smooth line plot for f(x). (5)
3. Use weighted sum of polynomials as an estimator function for f(x), in particular, let the form of estimator function be:
gn(x) = β0 + β1x + β2x2 + ..... + βnxn
Consider three candidate estimators, g1,g3, and g10. Estimate the coefficients of each of the three estimators using the sampled dataset and plot f(x),g1(x),g3(x),g10(x). Which estimator is underfitting ? Which one is overfitting ? (10)
4. Generate 100 datasets (each of size 50) by randomly sampling from y. Partition each dataset into training and test set (80/20 split). Next fit the estimators of varying complexity, i.e., g1,g2,....g15 using the training set for each dataset. Then calculate and display the squared bias, variance, and error on testing set for each of the estimators showing the tradeoff between bias and variance with model complexity. Can you identify the best model ? (10)
5. One way to increase model bias is by using regularization. Lets take the order 10 polynomial and apply L2 regularization. Compare the bias, variance, and MSE of the regularized model with the unregularized order 10 polynomial model ? Does the regularized model have a higher or lower bias ? What about MSE ? Explain. (10)
1
Problem 3 - OpenML, Algorithmic Performance Scaling
OpenML (https://www.openml.org) has thousands of datasets for classification tasks. Select any 2 datasets from OpenML with different number of output classes.
1. Summarize the attributes of each dataset: number of features, number of instances, number of classes, number of numerical features, number of categorical features. (5)
2. For each dataset, select 80% of data as training set and remaining 20% as test set. Generate 10 different subsets of the training set by randomly subsampling 10%,20%,...,100% of the training set. Use each of these subsets to train two different classifiers: Random forest and Gradient boosting. When training a classifier also measure the wall clock time to train. After each training, evaluate the accuracy of trained models on the test set. Report model accuracy and training time for each of the 10 subsets of the training set. Generate learning curve for each classifier. A learning curve shows how the accuracy changes with increasing size of training data. Also create a curve showing the training time of classifiers with increasing size of training data. So, for each dataset you will have two figures: First figure showing learning curves (x-axis being training data size and y-axis accuracy) for the two classifiers and second Figure showing training time for the two classifiers as a function of training data size. (15)
3. Study the scaling of training time and accuracy of classifiers with training data size using the two figures generated in part 2 of the question. Compare the performance of classifiers in terms of training time and accuracy and write 3 main observations. Which gives better accuracy ? Which has shorter training time ? (5)
Problem 4 - Precision, Recall, ROC
This question is based on two papers, one from ICML 2006 and other from NIPS 2015 (details below). ICML paper talks about the relationship between ROC and Precision-Recall (PR) curves and shows a one-to-one correspondence between them. NIPS paper introduces Precision-Recall-Gain (PRG) curves. You need to refer to the two papers to answer the following questions.
1. Does true negative matter for both ROC and PR curve ? Argue why each point on ROC curve corresponds to a unique point on PR curve ? (5)
2. Select one OpenML dataset with 2 output classes. Use two binary classifiers (Adaboost and Logistic regression) and create ROC and PR curves for each of them. You will have two figures: one containing two ROC and other containing two PR curves. Show the point where an all positive classifier lies in the ROC and PR curves. (10)
3. NIPS paper defined PR Gain curve. Calculate AUROC (Area under ROC), AUPR (Area under PR), and AUPRG (Area under PRG) for two classifiers and compare. Do you agree with the conclusion of NIPS paper that practitioners should use PR gain curves rather than PR curves. (10)