$34.99
10.009 The Digital World
Problem Set 10 (for week 10)
Please note:
Attempt this problem set using t he Jupyter notebook. It i s much more convenient for this problem set.
Objectives
1. Use the matplotlib l ibrary to visualize d ata u sing a scatter plot, bar chart, box plot and histogram
2. Explain the terms f eature, target and record
3. Obtain and i nterpret t he confusion m atrix for binary c lassification
4. Explain the k-Nearest N eighbours (kNN) classification m odel
5. Using the scikit-learn l ibrary a nd the breast c ancer dataset
a. Implement kNN classification model
b. Implement l inear r egression model
1
Cohort Session
Attention.
● A ll numpy arrays that are to be returned by functions are t o be t wo-dimensional arrays.
● Linear R egression is discussed i n Questions 5 , 6
● k-Nearest Neighbours i s discussed i n Questions 1 - 4 , 7
1. Confusion Matrix. B efore you do a ny m achine l earning, you s hould understand what a c onfusion matrix i s.
For this exercise, we will l imit ourselves to categorical t arget variables containing only two categories.
Suppose that you had four i mages of birds and four i mages of cats. You can t hink of each i mage as a record, and the target variable i s either b ird or c at.
Images are taken from various free stock photo sites.
Suppose you also had a computer program t hat t akes i n data from an i mage and i s able to predict what object i s within that i mage. This i s essentially what a machine learning model does. A machine l earning model takes i n the f eatures i n each record and makes a prediction on the target variable.
This prediction i s then compared against t he actual target variable. The results for all the records i n t he dataset are summarized i n t he confusion matrix. From t his matrix you can calculate measures such as the accuracy and the sensitivity. These measures tell you how well your model performs i n i ts prediction task.
To keep things simple, l et us assume that for our machine l earning model
● only i mages of a ctual birds or cats are given t o it as i nput
● the model can o nly tell you whether the i mage i s a bird o r cat
a. Let’s begin with a pen-and-paper e xercise. The actual target variables a re shown i n the variable named a ctual. The predictions made by a machine learning model are shown i n p redicted. From the data below, complete the confusion matrix.
actual = [ ' cat', ' cat', ' cat', ' cat', ' bird', ' bird', ' bird', ' bird'] predicted = [' cat', ' cat', ' bird', ' bird', ' cat', ' bird', ' bird', ' bird' ]
Predicted bird Predicted cat
Actual bird
Actual cat
How many r ecords were predicted c orrectly?
How many r ecords w ere predicted wrongly?
How m any ‘ bird’ were w rongly c lassified? How m any ‘cat’ w ere wrongly classified?
b. A s ample function g et_metrics() i s given b elow t hat takes i n t hree i nputs, a list of a ctual t argets, a list o f predicted targets a nd the l abels i n the order you want.
In the data above, there a re two categories, ‘bird’ and ‘cat’. One of them would have t o b e d esignated a s t he p ositive c ase. T he positive c ase is what you really want to p redict, or which category is more i mportant.
Suppose you h ave a dataset of fraudulent a nd n on-fraudulent c redit card transactions. I t would certainly be more i mportant for you t o i dentify f raudulent transactions. H ence w e w ould t reat ‘ fraudulent’ as the positive case.
Let us treat ‘ cat’ as the positive case. The l abels parameter should then specify the negative case, followed by the positive case.
labels = [ ' bird', ' cat' ]
Conversely, suppose you are m ore interested in b irds. Then you w ould t reat ‘bird’ a s t he p ositive case and the l abels p arameter should be s pecified as follows.
labels = [ ' cat', ' bird']
Run the following s cript to check y our p en-and-paper e xercise.
from sklearn.metrics i mport confusion_matrix
def g et_metrics( actual_targets, predicted_targets, labels):
c_matrix = c onfusion_matrix(actual_targets, predicted_targets, labels)
r eturn c_matrix
actual = [' cat', ' cat', ' cat', ' cat', ' bird', ' bird', ' bird', ' bird'] predicted = [' cat', ' cat', ' bird', ' bird', ' cat', ' bird', ' bird', ' bird'
]
labels = [' bird', ' cat'] print(get_metrics(actual, predicted, labels) )
○ Accuracy = t otal correct predictions / t otal records
○ Sensitivity = total c orrect p ositive c ases / total positive cases. (This metric i s also k nown as r ecall) .
○ False positive r ate = total false positives / t otal negative cases. The f alse positives are the actual negative cases that have been predicted t o b e p ositive.
Modify g et_metrics() to r eturn a d ictionary c ontaining t he confusion m atrix, as w ell as the results of these metrics above.
Round t he m etrics above t o t hree d ecimal p laces.
Submit your code t o V ocareum.
With t he b ird/cat data as above, the expected output i s as follows.
{' confusion m atrix': array([[3 , 1 ] ,
[2 , 2 ] ]), ' total records': 8 , ' accuracy': 0 .625, ' sensitivity': 0.5, ' false p ositive r ate': 0 .25}
2. The Five-Number Summary. A simple summary of each numerical feature i n the dataset i s the f ive number summary. It i s the numerical version of t he box plot. Write a function f ive_number_summary() t hat takes i n a numpy array and returns a l ist of dictionaries. Each dictionary contains the f ive number summary for the corresponding column.
The f unction definition i s given below and some suggested numpy f unctions are also given. x should be a 2 D-numpy array. If x i s otherwise, return N one.
def f ive_number_summary( x):
np.max(x) np.min(x)
np.percentile(x,2 5)
# and s o on
Test your function using the following script.
first_column = b unchobject.data[:, [ 1 ] ] print( f ive_number_summary(first_column) )
The expected o utput i s a s f ollows. (Your own output m ay h ave f loating point errors, which i s ok. Y ou a re not r equired to do a ny rounding in the r est of t his problem set) .
[{' minimum': 9 .71, ' first quartile': 1 6.17, ' median': 1 8.84, ' third quartile': 2 1.80, ' maximum': 3 9.28} ]
As a n exercise, p roduce the a ctual boxplot a nd compare i t w ith t he output a bove.
This function s hould a lso b e able to take i n more t han one column. Hence the following test script should work t oo.
col_no = [0 , 1 , 2 ]
some_columns = bunchobject.data[:,col_no] print( f ive_number_summary(some_columns) )
Write a f unction n ormalize_minmax() t hat takes i n a 2D-numpy array, normalizes i t using the min/max normalization and returns the normalized array.
The function header i s given to you. d ata should be a 2 D-numpy array. If d ata i s otherwise, return N one.
def n ormalize_minmax( data):
The following is a test c ase. W e c heck t he five number summary to see that t he values h ave been normalized.
first_column = bunchobject.data[:,[1 ] ] first_column_norm = n ormalize_minmax(first_column) print(five_number_summary(first_column_norm))
[{' minimum': 0 .0, ' first quartile': 0 .21846466012850865, ' median':
0.30875887724044637, ' third q uartile': 0 .40886033141697664, ' maximum': 1.0} ]
Your f unction should also work if more than one column is i nput. H ence, f or the following t est s cript:
cols = [1 , 7 ]
some_columns = bunchobject.data[:,cols] snorm = normalize_minmax(some_columns) print(' normalized', five_number_summary(snorm))
The e xpected output is a s follows.
normalized [{'minimum' : 0.0 , 'first quartile' : 0.21846466012850865 , 'median' : 0.30875887724044637 , 'third quartile' : 0.40886033141697664 ,
'maximum' : 1.0 } , {'minimum' : 0.0 , 'first quartile' : 0.10094433399602387 ,
'median' : 0.1665009940357853 , 'third quartile' : 0.36779324055666002 , 'maximum' : 1.0 }]
4. k-Nearest Neighbours model. H aving understood what a confusion matrix says, you are ready to build your first classifier using the k-Nearest Neighbours model.
Plot a Bar Chart for the target variable
Before doing so, we should plot a bar chart showing t he distribution of categories i n the target variable of the breast cancer dataset. Recall that i t i s useful t o understand the balance of classes i n the target variable. A bar chart i s a helpful visualization of this.
Write a function display_bar_chart() t hat takes i n four i nputs. For the first two inputs, see the test script below. The third i nput i s t he name of each category. The fourth i s an optional title. You need not submit t his function.
The function header i s given below.
def d isplay_bar_chart( positions, counts, names, title_name=' default' ):
Test y our f unction u sing this s cript.
unique, counts = n p.unique(bunchobject.target, return_counts = T rue) display_bar_chart(unique, counts, b unchobject.target_names)
You d o not need to s ubmit this f unction t o v ocareum.
Go on to the next page
Building your k-Nearest Neighbours classifier
Having seen the boxplot, we are ready to build a k-nearest neighbours classifier. The steps are as follows.
Step 1. Obtain the dataset. You have already seen how to do this.
Step 2. Select the features that are to be i ncluded i n the dataset. The dataset has 30 features, and f or our first analysis, l et us select t he first 20.
feature_list = range(2 0) # features from column 0 to 19 data = b unchobject.data[:, f eature_list]
Step 3 . E ach numerical feature selected is n ormalized using the min/max normalization.
Step 4 . The dataset ( which includes the t arget v ariable) i s d ivided i nto two sets, the training set a nd t he t est s et. T he analyst typically decides t he percentage and a typical value is to c hoose t he test s et from 40% of t he records. T he p erformance o f the m odel is checked using the data f rom the t est s et.
This is d one u sing the t rain_test_split( ) method, which conducts a random sampling f rom the records to g ive you the two sets. Read the documentation for details.
from sklearn.model_selection i mport train_test_split
data_train, data_test, target_train, target_test = train_test_split( data , t arget , test_size = 0 .40, random_state = 4 2 )
Step 5 . Select a value of k t o b uild the c lassifier. The c lassifier i s built using t he data from t he training s et.
Step 6 . The classifier i s t hen used to m ake predictions on the t arget v ariable in the test s et.
A p artial s et of code for t hese t wo steps i s g iven below. Read the documentation to find out how to c omplete i t.
clf = neighbors.KNeighborsClassifier(p ass) clf.fit(p ass)
target_predicted = c lf.predict(p ass)
Step 7. The results of this classification i s reported i n t he confusion matrix and t he various metrics. You have already written a method for t his.
These steps can be completed i n a single function. Write a function knn_classifier() t hat t akes i n t he following i nputs:
● The bunchobject that i s o btained a fter l oading the dataset
● A l ist containing the column numbers of the f eatures t o be s elected
● The size of the test set as a f raction of t he t otal number o f records ● A random number s eed to ensure that the results can be repeated ● The v alue of k t hat i s selected.
def k nn_classifier( bunchobject, f eature_list, size, s eed, k):
# step 2
# step 3
# step 4
# step 5 # step 6
results = get_metrics(p ass) # step7
r eturn results
The following i s a t est case, where the first 20 f eatures are selected.
features = range(2 0) results = knn_classifier(bunchobject, f eatures, 0 .40, 2 752, 3 ) print(results)
The output i s
{' confusion matrix': array([[1 41, 5 ] ,
[ 9 , 7 3] ]), ' total r ecords': 2 28, ' accuracy': 0 .939, 'sensitivity': 0 .89, ' false positive r ate': 0 .034}
Notes:
(1) The choice of f eatures in q uestion 4 and 7 i s arbitrary. Methods exist t o s elect features s ystematically, but w e a re not d iscussing t his in this p roblem set.
(2) The c hoice of k in this question is a rbitrary. In Question 7 , w e will see h ow to choose t he v alue of k systematically.
5. Linear Regression.
Create a scatter plot
To determine whether two features have a l inear relationship, the first step i s to create a scatter plot and examine i t.
Write a function d isplay_scatter() that takes i n two numpy vectors, together with optional arguments for the x-axis l abel, y-axis l abel and title. The scatter plot i s t hen displayed. The function header i s given below.
def d isplay_scatter( x,y, xlabel=' x', y label=' y', title_name = ' default') :
x_index = 0 y_index = 3 x = b unchobject.data[:,[x_index] ] y = bunchobject.data[:,[y_index] ] x_label = bunchobject.feature_names[x_index] y_label = bunchobject.feature_names[y_index]
display_scatter(x,y,x_label,y_label)
Obtaining the linear regression
Your scatter plot l ikely suggests that two features seem to have a l inear relationship. Using l inear regression, we are able to determine the extent to which t his i s true. We are also able to make predictions of the value of one feature from another. The steps are as follows.
Step 1. Obtain the dataset. You have already seen how to do this.
Step 3. The dataset i s divided i nto t wo sets at random, t he t raining set and t he t est set. The analyst t ypically decides t he percentage and a typical value i s t o choose t he test set f rom 40% of t he records.
This i s done using the t rain_test_split() method. Read the documentation for details.
from sklearn.model_selection i mport t rain_test_split
x_train, x _test, y _train, y_test = train_test_split( x , y , t est_size = 0.40, random_state = 4 2 )
Step 4. The l inear regression model i s built using the data f rom the training s et. Step 5. T he model i s then used t o make predictions on t he target v ariable i n the t est set.
A partial s et of code f or these two steps i s given below. R ead the d ocumentation to find out h ow to c omplete i t.
from sklearn i mport linear_model regr = linear_model.LinearRegression() regr.fit(p ass)
y_pred = r egr.predict(p ass)
Step 6. Obtain the c oefficients, intercept, mean-squared error and t he r2 score. You will need to import t he following functions. Read t he d ocumentation o n how to u se them.
from sklearn.metrics i mport mean_squared_error, r2_score
These steps can be completed i n a single function. Write a function linear_regression() that t akes i n the following i nputs:
● The bunchobject t hat i s obtained a fter l oading the dataset
● An i nteger that r epresents the c olumn number of t he x-variable
● An i nteger that r epresents the column number of the t arget v ariable
● The size of the test set as a f raction of t he total number of records
● A random number seed t o ensure that the results can be repeated
The function returns t he data i n the training set, the predictions made on the test set and a dictionary showing t he results of t he model (see output below). Submit this function l inear_regression() to vocareum.
def l inear_regression( bunchobject, x_index, y_index, s ize, s eed):
# step 2
# step 3
# step 4
# step 5 # step 6 r eturn x_train, y _train, x_test, y _pred, results
Also, complete the following function to plot the data that you obtained f rom your linear regression. You need not submit this function.
def p lot_linear_regression( x1, y1, x 2, y 2, x_label=' ', y_label=' ') :
plt.scatter(x1,y1, color=' black') p ass
The following i s a test case, where column 3 i s the target variable and column 0 are considered.
x_train, y _train, x _test, y _pred, results = linear_regression(bunchobject,0 , 3 , 0 .4, 2 752) print(results) plot_linear_regression(x_train, y_train, x_test, y_pred, b unchobject.feature_names[0 ] , b unchobject.feature_names[3 ] )
The o utput is a s f ollows. Remember to a lso produce the plot.
{' coefficients': array([[ 1 00.16755386] ]), ' intercept': a rray([- 760.52027342] ),
' mean s quared e rror': 2 631.2988797244757,
' r2 score': 0 .97772539335215169}
This question continues on t he next page.
Interpreting the results
The l ast step i s to think about the results that you have obtained.
● From the r2 score, w hat can you s ay about the extent to which both variables are correlated?
● Can you rely on the r2 score alone t o make this j udgement? (Read about anscombe’s quartet).
● Hence, what did you see f rom the scatter plot?
6. Multiple Linear Regression. I n the p revious question, you n oticed t hat a certain trend could not be reproduced by a pure l inear regression, i .e a y = a1 x + a2 model is not sufficient.
We can try to i mprove the fit by i ncluding higher order variables e.g. a second order model will l ook l ike this: y = a0 x2 + a 1 x + a2 . Including higher orders of the s ame independent variable is called P olynomial Regression a nd i s a s pecial case o f multiple l inear regression.
Modify the function you wrote i n Question 5. It will now t ake i n an additional parameter called o rder. o rder = 2 means you want to try fitting the data to a second order model l ike the one above.
The function header i s given below.
def m ultiple_linear_regression( bunchobject, x_index, y_index, order, size, s eed):
Previously, y our x-values a re c ontained i n a n umpy a rray with one column. With t his function, your m odel w ill n ow have multiple i nputs. You now need t o ensure that you now have a numpy array with the same number of columns as t he order specified. Each column will t hen correspond t o data for x , x2 , x3 a nd s o on.
Part of t he c ode needed t o achieve this is as follows. Complete the m issing parts by reading the d ocumentation. Please see t he Appendix at t he e nd a s w ell.
from sklearn.preprocessing i mport PolynomialFeatures
poly = PolynomialFeatures(p ass, include_bias=F alse) c_data = p oly.fit_transform(p ass)
The f unction r eturns t he d ata i n the t raining s et, the p redictions m ade on the test s et and a dictionary showing the results of the m odel (see o utput below). Please note that f or the x-values in e ach set, y ou are only required to return the c olumn for x1 as we need j ust this data to produce the s catter p lot. T he h igher orders o f x a re not needed. A reminder t hat you should return x 1 a s a 2 D n umpy a rray.
Submit the function m ultiple_linear_regression() t o vocareum.
The following i s a test script.
x_train, y _train, x _test, y _pred, r esults = multiple_linear_regression(bunchobject,0 , 3 , 4 , 0 .4, 2 752) print(results) plot_linear_regression(x_train, y_train, x _test, y _pred, b unchobject.feature_names[0 ] , b unchobject.feature_names[3 ] )
With this t est script, the output i s a s f ollows.
{' coefficients': a rray([[ - 1.28141031e+02, 1 .57502508e+01,
-5.29186793e-01, 7 .97220165e-03] ]), ' intercept': array([ 459.72265999] ), ' mean squared error': 1 45.64415629863078, ' r2 score': 0 .99876708559521399}
Run your function f or orders from 1 to 4 and tabulate the r 2 v alues and mean-squared error.
Is a higher order always better? What i s the best order to choose?
7. k-Nearest Neighbours ( full). In Question 4 , you had your f irst exposure to a k-Nearest Neighbours classifier, using one value of k t o m ake predictions o n the test set.
Before w e a ctually deploy a m odel t o m ake predictions, we w ill h ave t o go t hrough a validation p rocess. In Question 4 , we a rbitrarily selected t he value o f k . How do we know the value of k w e u sed i n Question 4 is t he value that g ives the best a ccuracy?
The solution is t o divide t he model into t hree sets, the ( 1) T raining s et, t he (2) Validation set, a nd (3) t he T est set.
The i dea is that we m ake p redictions w ith t he c lassifier on the v alidation s et w ith different v alues of k . F or e ach v alue of k , we r ecord t he metric that is important to us (e.g. a ccuracy). W e s elect t he smallest value o f k t hat gives u s the best p erformance on this metric.
You now have the best value of k . With this best value o f k , the classifier i s t hen used to m ake predictions on the t est s et. T his t hen g ives y ou an i dea of how t he m odel will perform o n n ew d ata.
Steps 1 to 3 remain the same as i n Question 4.
Step 4. Divide t he model i nto the t hree sets, i .e. training, validation and test set.
There i s no hard and fast rule on the proportions, but a typical split i s 60%: 20%:
20%, which we will use i n this question. You will have to apply the train_test_split() function twice.
data_train, data_part2, t arget_train, target_part2 = t rain_test_split( data , t arget , t est_size = 0 .40, r andom_state = 4 2 )
#now data_part2 a nd t arget_part2 contains 4 0% o f y our r ecords.
#next, t rain_test_split() i s called a gain
#to s plit d ata_part2 and t arget_part2 into t wo sets of 20% each
#fill in the i nputs marked p ass
data_validation, d ata_test, target_validation, t arget_test = train_test_split( p ass , p ass , test_size = p ass, r andom_state = 4 2 )
Step 5. The classifier i s built using the data from the training set.
Step 6. The classifier i s then u sed to make p redictions on t he t arget v ariable i n t he validation set. This i s repeated for values of k from 1 to a certain number, say 2 0. This h elps us t o decide w hich i s t he best v alue o f k to c hoose. T he p seudocode below s hows you how i t i s d one.
f or k i n range(1 , 2 1) :
# get a n instance of the classifier for a particular value of k
# fit the m odel to the t raining set ( step 5)
# make a prediction on the validation s et (step 6)
# get the a ccuracy o f the p rediction
# store t he accuracy in a list
After this, you would have i nformation on t he accuracy for each value of k stored i n a list. Y ou t hen c hoose the smallest value of k t hat gives you t he b est accuracy.
Explore t he following e xample code f or some hints o n h ow to d o t his with a l ist.
acc = [ 1 , 3 , 5 , 5 , 5 , 4 , 2 , 2 , 3 ] max_acc = m ax(acc) value = acc.index( m ax_acc )
Step 7. W ith the best v alue o f k , the c lassifier i s used to make p redictions o n the t est set.
Thus far, the t est s et h as not been involved i n building the m odel, n or h as i t b een involved i n the v alidation p rocess.
Hence, t he test set i s a g ood proxy f or data that has n ot been ‘ seen’ b y t he m odel. Using t he m odel on t he t est s et thus g ives you an i dea of the model’s performance on n ew d ata. This question continues o n the next page.
Complete the following function to carry out this process. Submit this function t o vocareum.
def k nn_classifier_full( bunchobject, f eature_list, s ize, seed):
# step2
# step3
# step4
# step5
# step6
# step7
r eturn o ut_results
out_results is a dictionary with three key-value pairs:
● ‘b est_k’ stores the b est v alue of k f ound
● ‘v alidation set’ stores t he d ictionary r eturned by get_metrics() f or t he predictions made o n the v alidation s et f or t his best v alue o f k
● ‘t est s et’ stores the d ictionary r eturned by g et_metrics() f or t he predictions made o n t he test set for this b est value of k
The following i s a test s cript.
features = r ange(2 0) # select features i n cols 0 t o 19 results = k nn_classifier_full(bunchobject, f eatures, 0 .40, 2 752) print(results)
The o utput o f this test s cript is shown b elow.
{'best k' : 4 , 'validation set' : {'confusion matrix' : array([[ 71, 2 ] ,
[ 2 , 39 ]]) , 'total records' : 114 , 'accuracy' : 0.965 , 'sensitivity' : 0.951 ,
'false positive rate' : 0.027 } , 'test set' : {'confusion matrix' : array([[ 69,
4] , [ 1 , 40 ]]) , 'total records' : 114 , 'accuracy' : 0.956 , 'sensitivity' : 0.976 , 'false positive rate' : 0.055 }}
End Of P roblem S et
Appendix For Question 6
Suppose you had extracted column 0 from t he breast cancer dataset as t he i ndependent variable (and column 3 i s the dependent variable).
b = d atasets.load_breast_cancer() x = b .data[:,[0]] y = b .data[:,[3]]
You w ish to f it it to a polynomial regression y = a 0 + a 1 x + a2 x 2 . There are two ways t hat you can d o it.
Option 1. Generate a 2D a rray c ontaining values for x a nd x 2 .
Thus, when fitting to the l inear equation, a sk for the constant t erm t o be calculated. To d o this,
● Set t he parameter include_bias t o False ● Set the p arameter fit_intercept to True.
## import statements not s hown poly = P olynomialFeatures( 2,i nclude_bias=False) c_data = poly.fit_transform(x)
print( c_data) # # c_data has two columns, one for x, one for x -squared
##code for t rain_test_split_not shown
regr = linear_model.LinearRegression(f it_intercept=True) regr.fit(c_train,y_train)
##and so on
Option 2. Generate a 2D array containing values of 1, x and x2 .
Because a c olumn o f o nes i s included i n the a rray f or x, when fitting to the l inear e quation, you need not a sk f or the c onstant term to be calculated. T o do t his,
● Set the parameter include_bias to T rue
● Set t he parameter f it_intercept t o F alse.
## import statements not shown
poly = P olynomialFeatures( 2,i nclude_bias=True) c_data = poly.fit_transform(x)
print( c_data)
## c _data has three columns, o ne of 1, one f or x , one f or x-squared
##code for train_test_split_not shown regr = linear_model.LinearRegression(f it_intercept=False) regr.fit(c_train,y_train)
##and so on