Starting from:

$30

CS6375 Assignment 4 Solved

 

 

1. **Support Vector Machines with Synthetic Data**, ¶

For this problem, we will generate synthetic data for a nonlinear binary classification problem and partition it into training, validation and test sets. Our goal is to understand the behavior of SVMs with Radial-Basis Function (RBF) kernels with different values of C and γ.


# DO NOT EDIT THIS FUNCTION; IF YOU WANT TO PLAY AROUND WITH DATA GENERATION, 

# MAKE A COPY OF THIS FUNCTION AND THEN EDIT

#

import numpy as np
from sklearn.datasets import make_moons from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap 

def generate_data(n_samples, tst_frac=0.2, val_frac=0.2): 

 # Generate a non-linear data set 

 X, y = make_moons(n_samples=n_samples, noise=0.25, random_state=42) 

   

 # Take a small subset of the data and make it VERY noisy; that is, generate outliers  m = 30 

 np.random.seed(30)  # Deliberately use a different seed 

 ind = np.random.permutation(n_samples)[:m] 

 X[ind, :] += np.random.multivariate_normal([0, 0], np.eye(2), (m, ))  y[ind] = 1 - y[ind] 

 # Plot this data 

 cmap = ListedColormap(['#b30065', '#178000'])   

 plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap, edgecolors='k')        

  

 # First, we use train_test_split to partition (X, y) into training and test sets 

 X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=tst_frac,                                                 random_state=42) 

 # Next, we use train_test_split to further partition (X_trn, y_trn) into tra ining and validation sets 

  X_trn, X_val, y_trn, y_val = train_test_split(X_trn, y_trn, test_size=val_fr ac,                                                 random_state=42) 

  

 return (X_trn, y_trn), (X_val, y_val), (X_tst, y_tst)

#

#  DO NOT EDIT THIS FUNCTION; IF YOU WANT TO PLAY AROUND WITH VISUALIZATION, 

#  MAKE A COPY OF THIS FUNCTION AND THEN EDIT 



def visualize(models, param, X, y): 

 # Initialize plotting  if len(models) % 3 == 0:    nrows = len(models) // 3  else:    nrows = len(models) // 3 + 1 

    

 fig, axes = plt.subplots(nrows=nrows, ncols=3, figsize=(15, 5.0 * nrows))  cmap = ListedColormap(['#b30065', '#178000']) 

 # Create a mesh 

 xMin, xMax = X[:, 0].min() - 1, X[:, 0].max() + 1  yMin, yMax = X[:, 1].min() - 1, X[:, 1].max() + 1  xMesh, yMesh = np.meshgrid(np.arange(xMin, xMax, 0.01),                              np.arange(yMin, yMax, 0.01)) 

  for i, (p, clf) in enumerate(models.items()):    # if i > 0:    #   break    r, c = np.divmod(i, 3)    ax = axes[r, c] 

   # Plot contours 

   zMesh = clf.decision_function(np.c_[xMesh.ravel(), yMesh.ravel()])    zMesh = zMesh.reshape(xMesh.shape) 

   ax.contourf(xMesh, yMesh, zMesh, cmap=plt.cm.PiYG, alpha=0.6) 

   if (param == 'C' and p > 0.0) or (param == 'gamma'):      ax.contour(xMesh, yMesh, zMesh, colors='k', levels=[-1, 0, 1],                  alpha=0.5, linestyles=['--', '-', '--']) 

   # Plot data 

   ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap, edgecolors='k')           ax.set_title('{0} = {1}'.format(param, p))

 

 

a. (25 points) The effect of the regularization parameter, C
Complete the Python code snippet below that takes the generated synthetic 2-d data as input and learns nonlinear SVMs. Use scikit-learn's SVC (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) function to learn SVM models with radial-basis kernels for fixed γ and various choices of

C ∈ {10−3,10−2 ⋯,1, ⋯ 105}. The value of γ is fixed to γ = d⋅1σX , where d is the data dimension and σX is the standard deviation of the data set X. SVC can automatically use these setting for γ if you pass the argument gamma = 'scale' (see documentation for more details).

Plot: For each classifier, compute both the training error and the validation error. Plot them together, making sure to label the axes and each curve clearly.

Discussion: How do the training error and the validation error change with C? Based on the visualization of the models and their resulting classifiers, how does changing C change the models? Explain in terms of minimizing the SVM's objective function  w′w   xi,yi), where ℓ is the hinge loss for each training example (xi,yi).

Final Model Selection: Use the validation set to select the best the classifier corresponding to the best value, Cbest. Report the accuracy on the test set for this selected best SVM model. Note: You should report a single number, your final test set accuracy on the model corresponding to $C{best}$_.

 

  File "<ipython-input-4-8875a1448a41>", line 17     visualize(models, 'C', X_trn, y_trn) 

            ^ 

IndentationError: expected an indented block 

 

b. (25 points) The effect of the RBF kernel parameter, γ
Complete the Python code snippet below that takes the generated synthetic 2-d data as input and learns various non-linear SVMs. Use scikit-learn's SVC (https://scikit-

learn.org/stable/modules/generated/sklearn.svm.SVC.html) function to learn SVM models with radial-basis kernels for fixed C and various choices of γ ∈ {10−2,10−1 1,10, 102 103}. The value of C is fixed to

C = 10.

Plot: For each classifier, compute both the training error and the validation error. Plot them together, making sure to label the axes and each curve clearly.

Discussion: How do the training error and the validation error change with γ? Based on the visualization of the

models and their resulting classifiers, how does changing γ change the models? Explain in terms of the functional form of the RBF kernel, κ(x, z) = exp(−γ ⋅ ∥x − z∥2)

Final Model Selection: Use the validation set to select the best the classifier corresponding to the best value, γbest. Report the accuracy on the test set for this selected best SVM model. Note: You should report a single number, your final test set accuracy on the model corresponding to $\gamma{best}$_.

 

 

2. **Breast Cancer Diagnosis with Support Vector Machines**, 25 points.

For this problem, we will use the Wisconsin Breast Cancer

(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) data set, which has already been pre-processed and partitioned into training, validation and test sets. Numpy's loadtxt

(https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html) command can be used to load CSV files.

 

Use scikit-learn's SVC (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) function to learn SVM models with radial-basis kernels for each combination of C ∈ {10−2,10−1,1,101, ⋯ 104} and γ ∈ {10−3,10−2 10−1,1, 10, 102}. Print the tables corresponding to the training and validation errors.

Final Model Selection: Use the validation set to select the best the classifier corresponding to the best parameter values, Cbest and γbest. Report the accuracy on the test set for this selected best SVM model. Note: You should report a single number, your final test set accuracy on the model corresponding to $C{best} and\gamma{best}$.

 

 

3. **Breast Cancer Diagnosis with k-Nearest Neighbors**, 25 points.

Use scikit-learn's k-nearest neighbor (https://scikit-

learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) classifier to learn models for Breast Cancer Diagnosis with k ∈ {1, 5, 11, 15, 21}, with the kd-tree algorithm.

Plot: For each classifier, compute both the training error and the validation error. Plot them together, making sure to label the axes and each curve clearly.

Final Model Selection: Use the validation set to select the best the classifier corresponding to the best parameter value, kbest. Report the accuracy on the test set for this selected best kNN model. Note: You should report a single number, your final test set accuracy on the model corresponding to $k{best}$_.

 

Discussion: Which of these two approaches, SVMs or kNN, would you prefer for this classification task? Explain.

More products