$30
The Learning Problem
1. Which of the following problem is suited for machine learning if there is assumed to be enough associated data? Choose the correct answer; explain how you can possibly use machine learning to solve it.
[a] predicting the winning number of the next invoice lottery
[b] calculating the average score of 500 students
[c] identifying the exact minimal spanning tree of a graph
[d] ranking mango images by the quality of the mangoes
[e] none of the other choices
2. Which of the following describes an machine learning approach to build a system for spam detection? Choose the correct answer; explain briefly why you think other choices are not machine learning.
[a] flip 3 fair coins; classify the email as a spam iff at least 2 of them are heads
[b] forward the email to 3 humans; classify the email as a spam iff at least 2 of them believe so
[c] produce a list of words for spams by 3 humans; classify the email as a spam iff the email contains more than 10 words from the list
[d] get a data set that contains spams and non-spams, for all words in the data set, let the machine calculate the ratio of spams per word; produce a list of words that appear more than 5 times and are of the highest 20% ratio; classify the email as a spam iff the email contains more than 10 words from the list
[e] get a data set that contains spams and non-spams, for all words in the data set, let the machine decide its “spam score”; sum the score up for each email; let the machine optimize a threshold that achieves the best precision of spam detection; classify the email as a spam iff the email is of score more than the threshold
Perceptron Learning Algorithm
Next, we will play with multiple variations to the Perceptron Learning Algorithm (PLA).
3. Dr. Short scales down all xn (including the x0 within) linearly by a factor of 4 before running PLA. How does the worst-case speed of PLA (in terms of the bound on page 16 of lecture 2) change after scaling? Choose the correct answer; explain your answer.
[a] 4 times smaller (i.e. faster)
[b] 2 times smaller
√
[c] 2 times smaller
[d] unchanged
√
[e] 2 times larger (i.e. slower)
4. The scaling in the previous problem is equivalent to inserting a “learning rate” η to the PLA update rule
wt+1 ← wt + η · yn(t)xn(t)
with . In fact, we do not need to use a fixed η. Let ηt denote the learning rate in the t-th iteration; that is, let PLA update wt by
wt+1 ← wt + ηt · yn(t)xn(t)
whenever (xn(t),yn(t)) is not correctly classified by wt. Dr. Adaptive decides to set so
“longer” xn(t) will not affect wt too much. Let
,
which can be viewed as a “normalized” version of the ρ on page 16 of lecture 2. The bound on the same page then becomes ˆρ−p after using this adaptive ηt. What is p? Choose the correct answer; explain your answer.
[a] 0
[b] 1
[c] 2
[d] 4
[e] 8
5. Another possibility of setting ηt is to consider how negative yn(t)wtTxn(t) is, and try to make
0; that is, let wt+1 correctly classify (xn(t),yn(t)). Which of the following update
rules make 0? Choose the correct answer; explain your answer.
[a] wt+1 ← wt + 2 · yn(t)xn(t)
6. Dr. Separate decides to use one of the update rules in the previous problem for PLA. When the data set is linear separable, how many choices in the previous problem ensures halting with a “perfect line”? Choose the correct answer; explain the reason behind each halting case.
[a] 1
[b] 2
[c] 3
[d] 4
[e] 5
Types of Learning
7. One shared technique between the famous AlphaGo, AlphaGo Zero, and AlphaStar is called selfpracticing: learning to play the game by practicing with itself and getting the feedback from the “judge” environment. What best describes the learning problem behind self-practicing? Choose the correct answer; explain your answer.
[a] human learning
[b] unsupervised learning
[c] semi-supervised learning
[d] supervised learning
[e] reinforcement learning
8. Consider formulating a learning problem for building a self-driving car. First, we gather a training data set that consists of 100 hours of video that contains the view in front of a car, and records about how the human behind the wheel acted with physically constrained choices like steering, braking, and signaling-before-turning. We also gather another 100 hours of videos from 1126 more cars without the human records. The learning algorithm is expected to learn from all the videos to obtain a hypothesis that imitates the human actions well. What learning problem best matches the description above? Choose the correct answer; explain your answer.
[a] regression, unsupervised learning, active learning, concrete features
[b] structured learning, semi-supervised learning, batch learning, raw features
[c] structured learning, supervised learning, batch learning, concrete features
[d] regression, reinforcement learning, batch learning, concrete features
[e] structured learning, supervised learning, online learning, concrete features
(We are definitely not hinting that you should build a self-driving car this way. :-D )
Off-Training-Set Error
As discussed on page 5 of lecture 4, what we really care about is whether g ≈ f outside D. For a set of “universe” examples U with D ⊂ U, the error outside D is typically called the Off-Training-Set (OTS) error
1 X
Eots(h) = h(x) 6= y .
|U \ D| (x,y)∈U\D J K
9. Consider U with 6 examples
Run the process of choosing any three examples from U as D, and learn a perceptron hypothesis (say, with PLA, or any of your “human learning” algorithm) to achieve Ein(g) = 0 on D. Then, evaluate g outside D. What is the smallest and largest Eots(g)? Choose the correct answer; explain your answer.
[e] (0,1)
Hoeffding Inequality
10. Suppose you are given a biased coin with one side coming up with probability . How many times do you need to toss the coin to find out the more probable side with probability at least 1−δ using the Hoeffding’s Inequality mentioned in page 10 of lecture 4? Choose the correct answer; explain your answer. (Hint: There are multiple versions of Hoeffding’s inequality. Please use the version in the lecture, albeit slightly loose, for answering this question. The log here is loge.)
Bad Data
11. Consider x = [x1,x2]T ∈ R2, a target function f(x) = sign(x1), a hypothesis h1(x) = sign(2x1−x2), and another hypothesis h2(x) = sign(x2). When drawing 5 examples independently and uniformly within [−1,+1] × [−1,+1] as D, what is the probability that we get 5 examples (xn,f(xn)) such that Ein(h2) = 0? Choose the correct answer; explain your answer. (Note: This is one of the BAD-data cases for h2 where Ein(h2) is far from Eout(h2).)
[a] 0
[b] [c]
[d]
[e] 1
12. Following the setting of the previous problem, what is the probability that we get 5 examples such that Ein(h2) = Ein(h1), including both the zero and non-zero Ein cases? Choose the correct answer; explain your answer. (Note: This is one of the BAD-data cases where we cannot distinguish the better-Eout hypothesis h1 from the worse hypothesis h2.)
[a]
[b] [c] [d]
[e]
13. According to page 22 of lecture 4, for a hypothesis set H,
BAD D for H ⇐⇒ ∃h ∈ H s.t.
Let x = [x1,x2,··· ,xd]T ∈ Rd with d 1. Consider a binary classification target with Y = {+1,−1} and a hypothesis set H with 2d hypotheses h1,··· ,h2d.
• For i = 1,··· ,d, hi(x) = sign(xi).
• For i = d + 1,··· ,2d, hi(x) = −sign(xi−d).
Extend the Hoeffding’s Inequality mentioned in page 10 of lecture 4 with a proper union bound. Then, for any given N and , what is the smallest C that makes this inequality true?
P[BAD .
Choose the correct answer; explain your answer.
[a] C = 1 [b] C = d
[c] C = 2d
[d] C = 4d [e] C = ∞
Multiple-Bin Sampling
14. We then illustrate what happens with multiple-bin sampling with an experiment that use a dice (instead of a marble) to bind the six faces together. Please note that the dice is not meant to be thrown for random experiments. The probability below only refers to drawing the dices from the bag. Try to view each number as a hypothesis, and each dice as an example in our multiple-bin scenario. You can see that no single number is always green—that is, Eout of each hypothesis is always non-zero. In the next two problems, we are essentially asking you to calculate the probability of getting Ein(h3) = 0, and the probability of the minimum Ein(hi) = 0.
Consider four kinds of dice in a bag, with the same (super large) quantity for each kind.
• A: all even numbers are colored green, all odd numbers are colored orange
• B: (2,3,4) are colored green, others are colored orange
• C: the number 6 is colored green, all other numbers are colored orange
• D: all primes are colored green, others are colored orange
If we draw 5 dices independently from the bag, which combination is of the same probability as getting five green 3’s? Choose the correct answer; explain your answer.
[a] five green 1’s
[b] five orange 2’s
[c] five green 2’s
[d] five green 4’s [e] five green 5’s
15. Following the previous problem, if we draw 5 dices independently from the bag, what is the probability that we get some number that is purely green? Choose the correct answer; explain your answer.
[a]
[b] [c] [d]
[e]
Experiments with Perceptron Learning Algorithm
Next, we use an artificial data set to study PLA. The data set with N = 100 examples is in
http://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw1/hw1_train.dat
Each line of the data set contains one (xn,yn) with xn ∈ R10. The first 10 numbers of the line contains the components of xn orderly, the last number is yn. Please initialize your algorithm with w = 0 and take sign(0) as −1.
16. (*) Please first follow page 4 of lecture 2, and add x0 = 1 to every xn. Implement a version of PLA that randomly picks an example (xn,yn) in every iteration, and updates wt if and only if wt is incorrect on the example. Note that the random picking can be simply implemented with replacement—that is, the same example can be picked multiple times, even consecutively. Stop updating and return wt as wPLA if wt is correct consecutively after checking 5N randomly-picked examples.
Hint: (1) The update procedure described above is equivalent to the procedure of gathering all the incorrect examples first and then randomly picking an example among the incorrect ones. But the description above is usually much easier to implement. (2) The stopping criterion above is a randomized, more efficient implementation of checking whether wt makes no mistakes on the data set.
Repeat your experiment for 1000 times, each with a different random seed. What is the median number of updates before the algorithm returns wPLA? Choose the closest value.
[a] 8
[b] 11
[c] 14
[d] 17
[e] 20
17. (*) Among all the w0 (the zero-th component of wPLA) obtained from the 1000 experiments above, what is the median? Choose the closest value.
[a] -10
[b] -5 [c] 0
[d] 5
[e] 10
18. (*) Set x0 = 10 to every xn instead of x0 = 1, and repeat the 1000 experiments above. What is the median number of updates before the algorithm returns wPLA? Choose the closest value.
[a] 8
[b] 11
[c] 14
[d] 17
[e] 20
19. (*) Set x0 = 0 to every xn instead of x0 = 1. This equivalently means not adding any x0, and you will get a separating hyperplane that passes the origin. Repeat the 1000 experiments above. What is the median number of updates before the algorithm returns wPLA?
[a] 8
[b] 11
[c] 14
[d] 17
[e] 20
20. (*) Now, in addition to setting x0 = 0 to every xn, scale down each xn by 4. Repeat the 1000 experiments above. What is the median number of updates before the algorithm returns wPLA? Choose the closest value.
[a] 8
[b] 11
[c] 14
[d] 17
[e] 20