$25
A. Mathematics
Consider a classification problem where we wish to determine if a human subject is likely to get a concussion in the next year. We use four features - x1 (Age), x2 (concussHistory), x3
(FavoriteSport), x4 (LastEducation), x5 (Gender) . Each feature takes on one of a discrete number of values, shown below:
Age
Child
Teen
Adult
concussHistory
Never
Recent
DecadesAgo
FavoriteSport
Boxing
Golf
Rugby
Baseball
LastEducation
Masters
HighSchool
College
Gender
Female
Male
We wish to classify each user as either yi=LikelyConcuss or yi=NotLikelyConcuss.
1. How can the features above be transformed to use a logistic classifier? Explain in 1-4 sentences.
2. How many parameters are required to learn a separating hyper-plane (w and any other necessary elements) for logistic classification with the features converted in question 1? (Work from your answer to question 1. If you could not figure out question 1, assume we have a new space of 6 continuous numeric features x1, x2, …, x5 – this may or may not be a valid result from question 3.)
3. Which of the following vectors has (approximately) unit magnitude? (At most 1.1, at least 0.9.)
0.3 0 1 2
(a) [−0.6] (b) [ 0.6 ] (c) [−1] (d) [−01]
0.7 −0.4 0
−0.6 0.7 −1 0.5
4. What is the projection (dot product) of the vectors below onto the unit magnitude vector from question 3? (If you believe none of the vectors in question 3 have unit magnitude, just use vector (d) from question 3.)
2 −2 0.5
(a) [10] (b) [ 10 ] (c) [1−.15]
0 2 1.5
1
5. Let us assume we have a classifier with 𝒘 = [−4] and 𝑏 = 2. Provide two distinct inputs xi
6
that would be on the boundary between class 0 and 1, specifically find two inputs xi where g(xi;w)=0.5.
6. In class, we define the logistic/sigmoid function as 𝑔(ℎ) = , producing the curve in solid blue shown to the right. Note while g(h)≈0 when h is a very low negative number (h<-4) and g(h)≈1 when h is a very high positive number (h>4), when h is close to zero, -2<h<2, g(h) transitions roughly linearly between 0.1 and 0.9 . Consider in contrast galt(h) in the dashed red curve, which stays close to 0 for input h = - 2 and stays close to 1 for input h=2.
(a) How can we adjust the definition of g(h) to create the dashed red curve galt(h)? (You do not need to give the exact equation for galt(h), but explain what alterations if any are made to the numerator, denominator, exponent, etc.)
(b) We now wish to have a modified sigmoid galt2 where galt2(h)< .05 only when h<-10 and galt2(h)> .95 only when h>10. Provide an exact algebraic formula for galt2(h) to fit this specification.
In class, we considered the logistic function output as a pseudo-probability and used it to define a likelihood for data given the parameters in w (for the following three questions we will assume b is included in w).
𝑖; 𝒘))(1−𝑦𝑖) 𝑔(𝒙𝑖; 𝒘)𝑦𝑖
𝐿(𝑦|𝒙; 𝒘) = ∏ (1 − 𝑔(𝒙
𝑖
Note for binary classification, y=0 or y=1. The goal is to find w to maximize L for all data (xi,yi). Let us say we skip the logistic function and learn the two-way classifier w to maximize a sum:
𝑑
S(y|x; w) =
𝒘 𝒙
7. Replace d with a valid expression such that S(y|x; w) will be largest when w is correctly selected to correctly classify data from class 0 (yi=0) as wTx<0 and from class 1 (yi=1) as wTx>0. d should be an algebraic expression involving some variables w, xi, and/or yi. For example: d=|xi| or d=10+yi .
8. What is the derivative of S(y|x; w) with respect to a single element of w, wj? (Consider the definition of wTx.)
𝑤𝑗
9. Recall L2 regularization affects the gradient ascent learning rule by adding + .
𝜆
3
|𝑤𝑗−𝜇𝑗|
Let us now create a new regularization with the prior probability (𝑤𝑗) = 𝑒− 𝛼 .
What is the resulting change to the gradient ascent learning rule?
B. SVMs
1.
(a) Let us use SVM to define a linear classifier with the following support vectors and 𝛼’s:
−2 −2 2 1
𝒙1 = [ 30 ], 𝒙2 = [ 00 ], 𝒙3 = [−24], 𝒙4 = [−04]
0 3 0 5
𝑦1 = +1, 𝑦2 = +1, 𝑦3 = −1, 𝑦4 = −1
𝛼1 = 1.2, 𝛼2 = 0.9, 𝛼3 = 1.2, 𝛼4 = 0.9
What is the resulting w?
(b) Consider the same support vectors as before but a different set of 𝛼’s:
𝛼1 = 1.5, 𝛼2 = 1, 𝛼3 = 2, 𝛼4 =? ? ?
Presuming x1, x2, x3, and x4 are the only support vectors available, what value must 𝛼4 have?
What is the resulting w?
2. The respective update rules for logistic classification and SVM learning are
𝑤 ← 𝑤 + 𝜀𝒙𝑖(𝑦𝑖 − 𝑔)
and
𝑤 ← 𝑤 − 2𝑤 − ∑ 𝑦𝑖𝜆𝑖𝒙𝑖
𝑖
(a) Explain in 1-2 sentences how logistic classification learning minimizes the impact of correctly-labeled data points far from the hyperplane.
(b) Explain in 1-2 sentences how SVM learning minimizes the impact of correctly-labeled data points far from the hyperplane.
(c) Which of the two methods (or both) allows all data points to be used in the final hyperplane definition?
We did not yet define “Kernel functions”, but I recommend students try this question anyway … just treat it as a norm function that takes in two vectors. It will count for a small amount of credit.
3. (6 points) Consider the following kernel function:
𝒂T 𝐛
𝐾(𝒂, 𝒃) =
|𝒂||𝒃| And further consider the following support vectors:
1 0 −0.2
𝒙𝟒 = [−2] 𝒙𝟓 = [0.5] 𝒙𝟔 = [ 11 ]
0 0
−2 1 −0.5
What is the output of the kernel function for each pair of inputs below?
K(x4,x6) K(x6,x5)
C. Programming
In this question you will implement algorithms for learning parameters and classifying with Logistic Classification.
To submit Part C, leave your code in the Python file hw2.py inside your private/CIS5800 directory.
Accessing our data
The file hw2data.mat is available on our website (and on erdos using
cp ~dleeds/MLpublic/hw2data.mat .) Load this file into Python to get access to the fullData numpy array. For this array, each row is one example student. Columns 0 through 9 represent ten features, corresponding to diverse information about the student – her/his GPA, SAT scores, age, level of financial aid, etc. The last column (column fullData[:,10]) represents the student major yi – 0 (History) or 1 (Chemistry).
1. Write a function called sigmoidLikelihood that takes in the m features for n students as a nparray of dimension (n,m) , the corresponding class labels for n students as a np-array of dimension (n), and the separating hyper-plane represented as w by a np-array of dimension (m+1). The b scalar offset is included as the final element of w. The function returns the sigmoid-based pseudo-likelihood of each student being in the corresponding class, as a np-array of dimension (n).
Use the following syntax:
LVector = sigmoidLikelihood(X, y, w)
The pseudo-likelihood is computed using the sigmoid function (which you must implement for this homework).
2. The function sigmoidLikelihood returns an np-array of pseudo-likelihoods for n data points. If we wish to calculate the full pseudo-likelihood across all input, we will multiply the function outputs: ℒ = ∏𝑖 𝑃(𝒙𝑖, 𝑦𝑖; 𝒘) or np.prod(LVector) . However, if too many small probabilities are multiplied together, the total pseudo-likelihood will be approximated to exactly zero, losing potentially valuable information. This estimation to zero can be avoided by taking the pseudo log-likelihood np.sum(np.log(LVector)) .
(a) If LVector contains all values of 0.05 ([0.05, 0.05, 0.05, …, 0.05]), how many data points (elements in LVector) are needed for np.prod to estimate the pseudo-likelihood as perfectly 0?
(b) What is the pseudo-log-likelihood equivalent given the number of data points from part (a)?
Include your answers as a comment in hw2.py.
3. Write a function called learnLogistic that takes in the initial hyper-plane vector w0 as a nparray of dimension (m+1) with b as the final element, training data x with n data points and m features as a np-array of dimension (n,m), the correct labels for each point y as a np-array of dimension (n), and the number of learning loops K. The function outputs the new weights w as a np-array of dimension (m+1) and the log-likelihood of the input data after each loop as a nparray of dimension (K).
Use the following syntax:
w, LHistory = learnLogistic(w0, X, y, K)
Note: For each “loop,” learnLogistic should loop through each data point in the training set and use gradient ascent to update w for each data point.
# this is pseudo-code! for dataPt i : for feature j :
update[j] += stepSize * updateRule(x[i,j],y[i]) for feature j : w[j] += update[j]
Assume stepSize=0.01.
4. In class, we discussed that learnLogistic can run faster if you use fewer loops. For example, instead of looping on feature j, numpy allows you to write:
# this is pseudo-code!
update += stepSize * updateRule(x[i,:],y[i])
(a) Write learnLogisticFast to accept the same inputs and produce the same outputs as learnLogistic above, but using vector math to remove at least one loop (you may be able to remove more than one, but only have to remove one).
(b) Compare the speeds of learnLogistic and learnLogisticFast, by running them on the data in hw2data.mat with at least 10 loops. Provide your answer in a comment.
Specifically, use time.time() to determine the time it takes each function to run:
# this part is not pseudo-code
# comment this out/delete it before submitting hw2.py
# JUST SUBMIT THE TIME IT TAKES TO RUN learnLogistic AND
# learnLogisticFast import time timeStart=time.time()
w, LHistory = learnLogistic(w0,X,y,K) timeEnd=time.time() print(timeEnd-timeStart)
5. Write a function logisticClassify that takes in the m feature values for the n data points as a np-array with dimensions (n,m) and the weight vector w as a np-array with dimension (m+1) (with b as the final element of w). The function returns the 0/1 label for each data point as a np-array of dimension (n).
Use the following syntax:
classLabels=logisticClassify(x,w)