Starting from:

$30

NTUCSIE-Homework 1 Solved

1.    Which of the following problem is suited for machine learning if there is assumed to be enough associated data? Choose the correct answer; explain how you can possibly use machine learning to solve it.

[a]   identifying the shortest distance path from Taipei to Kao-Hsiung

[b]   predicting the winning number of the super lotto

[c]    calculating the total salary of all research assistants in a lab

[d]   sorting your e-commerce website users by their predicted chances of making a purchase in the next 7 days

[e]   none of the other choices

Modifications of PLA

2.    One possible modification of PLA is to insert a per-iteration “learning rate” ηt to the update rule

wt+1 ← wt + yn(t)xn(t) · ηt

There are many possibilities of setting the learning rate ηt. One is to consider how negative yn(t)wtTxn(t) is, and try to aggressively make  0; that is, let wt+1 correctly classify (xn(t),yn(t)). Another one is to conservatively decrease ηt so there is less oscillation during the final convergence stage. Which of the following update rules make  0? Choose the correct answer; explain your answer.

[a]   wt+1 ← wt + yn(t)xn(t) · (2−t)

3.    Dr. Separate decides to use one of the update rules in the previous problem for PLA. When the data set is linear separable, how many choices in the previous problem ensures halting with a “perfect line”? Choose the correct answer; explain the reason behind each perfect halting case.

[a] 1

b)2
[c]  3
d]     4                                                                                                                                   [e]    5

   
Extensions of PLA


4.    Consider online spam detection with machine learning. We will represent each email x by the distinct words that it contains. In particular, assume that there are at most m distinct words in each email, and each word belongs to a big dictionary of size d ≥ m. The i-th component xi is defined as word i is in email x for i = 1,2,...,d, and x0 = 1 as always. We will assume that d+ of the words in the dictionary are more spam-like, andJ K d− = d−d+ of the words are less spam-like. A simple function that classifies whether an email is a spam is to count z+(x), the number of more spam-like words with the email (ignoring duplicates), and z−(x), the number of less spam-like words in the email, and classify by

f(x) = sign(z+(x) − z−(x) − 0.5).

That is, an email x is classified as a spam iff the integer z+(x) is more than the integer z−(x).

Assume that f can perfectly classify any email into spam/non-spam, but is unknown to us. We now run an online version of PLA to try to approximate f. That is, we maintain a weight vector wt in the online PLA, initialized with w0 = 0. Then for every email xt encountered at time t, the algorithm makes a prediction sign(w ), and receives a true label yt. If the prediction is not the same as the true label (i.e. a mistake), the algorithm updates wt by

wt+1 ← wt + ytxt.

Otherwise the algorithm keeps wt without updating

wt+1 ← wt.

What is the maximum number of mistakes that the online PLA can make for this spam classification problem? Choose the tightest upper bound; explain your answer.

Note: For those who know the bag-of-words representation for documents, the representation we use is a simplification that ignores duplicates of the same word.

[a]   4(m + 1)2

[b]   4(d + 1)(m + 1)

5. For multiclass classification with K classes, the multiclass PLA maintains K weight vectors w , w , where (t) is used to indicate the vectors in the t-th iteration. The goal is to obtain some hypothesis g represented by the solutions wk∗ returned from the algorithm, where g makes a prediction by g(x) = argmaxk∈{1,2,...,K}(wk∗)Tx .

All weight vectors are initialized to 0 in the beginning. In iteration t, the multiclass PLA first finds a “mistake example” with index n(t) that
yn(t) 6= argmax  .

Abbreviate yn(t) as y and let y0 be the argmax result above. The multiclass PLA then updates

(t)  (t) weight vectors wy                and wy0 by

w          w  w     w 

             If there are no more mistakes, return all wk           as the multiclass PLA solution wk

Assume that K = 2. Given a linear separable data set  , where yn ∈ {1,2}.

Run the binary PLA (initialized with the zero weight vector) taught in class on the data set with transformed labels ˜yn = 2yn −3 (so ˜yn ∈ {−1,+1}). Then, run the multiclass PLA above with the same order of examples (i.e. n(t) across the two algorithms are the same). What is the relationship between wPLA produced by the binary PLA, and the w1∗ and w2∗ produced by the multiclass PLA? Choose the correct answer; explain your answer.

Note: If there is a tie in argmax, we will assume the “worst-case” scenario that argmax returns some k that is not equal to yn(t). Similarly, for the binary PLA, if sign(0) is encountered when checking the mistake on some (xn(t),yn(t)), we will assume the “worst-case” scenario that sign returns the sign that is not equal to yn(t).

[a] wPLA 
[b] wPLA 
[c] wPLA 
[d] wPLA 
[e] none of the other choices

The Learning Problems

6.    Consider the following process: “We set up five cameras and capture video sequences from five different angles. The video sequences are timed but not specifically labeled. After collecting the sequences, we train a learning model that outputs a real-valued vector for each image in the video. The goal is that images that are taken at similar time stamps should be mapped to similar vectors, and images that are at distant time stamps should be mapped to different vectors. The learned model can then be combined with a variety of tasks, such as being mixed with some labeled data for object recognition, or some interactive environment that trains a robot to grab an object from the table.” Which of the following best matches the process? Choose the best answer; explain your answer.

[a]   active learning

[b]   reinforcement learning

[c]    multi-class classification

[d]   self-supervised learning

[e]   online learning

7.    Consider formulating a learning problem for building a news tagger. First, we gather a training data set that consists of 1126 news articles, stored in utf8 encoding, and ask domain experts to tag each article by its categories. Each article can belong to several different categories. We also gather another 11265566 news articles from our database, without expert labeling. The learning algorithm is expected to learn from all the articles and tags (if any) to obtain a hypothesis that tags future news articles well. What learning problem best matches the description above? Choose the best answer; explain your answer.

[a]   regression, unsupervised learning, active learning, concrete features

[b]   text generation, semi-supervised learning, batch learning, concrete features

[c]    multilabel classification, semi-supervised learning, batch learning, raw features

[d]   regression, reinforcement learning, batch learning, concrete features

[e]   multilabel classification, supervised learning, online learning, concrete features

(We are definitely not hinting that you should build a news tagger this way. :-D )

Off-Training-Set Error

8.    As discussed in lecture 3, what we really care about is whether g ≈ f outside D. For a set of “universe” examples U with D ⊂ U, the error outside D is typically called the Off-Training-Set

(OTS) error

 1

K
Consider U with 6 examples

(0,0),
+1
(0,3),
+1
(1,1),
−1
(2,1),
−1
(3,0),
−1
Run the process of choosing any three examples from U as D, and learn a perceptron hypothesis (say, with PLA, or any of your “human learning” algorithm) to achieve Ein(g) = 0 on D. Then, evaluate g outside D. What is the smallest and largest Eots(g)? Choose the correct answer; explain your answer.

 

Point Estimation
9.    In lecture 3, we try to infer the unknown out-of-sample error Eout(h) by the in-sample error Ein(h) based on the observed data. This is an example of so-called point estimation. Formally, a point estimator is defined as a function of the data:

θˆ= g(x1,...,xN),

where {x1,...,xN} is a set of N examples independent and identically (i.i.d.) generated from a distribution P. That is, θˆ is also a random variable. Assume that the quantity that we try to infer is θ (determined by the distribution P), we attempt to use θˆ as a “guess” of θ. For instance, θ = Eout(h) and θˆ= Ein(h) in our lecture. One criteria to judge the goodness of θˆ is whether it is unbiased. We call θˆ unbiased iff

E[θˆ] = θ.

Consider the following point estimators. Which of them is not unbiased? Choose the correct answer; explain your answer.

 =1 and θ = Eout(h) = Ex∼P h(x) =6 f(x) , where h is a fixed hypothesis, f(x) =J y is the target function, andK P is some fixed probability distribution thatJ K generates the data.

  is a Bernoulli distribution defined by P(x) = θx(1 − θ)(1−x), where

[c] θˆ= max{x1,...,xN}. P is a discrete uniform distribution whose probability mass function is defined as P(x) = M1 for x ∈ {1,...,M} (M > N > 0). θ = M.

 is a zero-mean Gaussian distribution and θ is its variance.

[e] none of the other choices (i.e. all of the other choices are unbiased)

Bad Data

10.   Consider x = [x1,x2]T ∈ R2, a target function f(x) = sign(x1), a hypothesis h1(x) = sign(2x1−x2), and another hypothesis h2(x) = sign(x2). What is (Eout(h1),Eout(h2)) subject to the uniform distribution in [−1,+1] × [−1,+1] that generates x? Choose the correct answer; explain your answer.

11.   Following the previous problem, when drawing 4 examples independently and uniformly within [−1,+1]×[−1,+1] as D, what is the probability that we get 4 examples such that Ein(h2) = Ein(h1), including both the zero and non-zero Ein cases? Choose the correct answer; explain your answer.

Note: This is one of the BAD-data cases where we cannot distinguish the better-Eout hypothesis h1 from the worse hypothesis h2.

[a]  

[b]  [c]  [d]  

[e]  
Multiple-Bin Sampling

12.   We illustrate what happens with multiple-bin sampling with an experiment that use a dice (instead of a marble) to bind the six faces together. Please note that the dice is not meant to be thrown for random experiments. The probability below only refers to drawing the dices from the bag. Try to view each number as a hypothesis, and each dice as an example in our multiple-bin scenario. You can see that no single number is always green—that is, Eout of each hypothesis is always non-zero. In the following problem, we are essentially asking you to calculate the probability of the minimum Ein(hi) = 0.

Consider four kinds of dice in a bag, with the same (super large) quantity for each kind.

•   A: all even numbers are colored green, all odd numbers are colored orange

•   B: (1,2,6) are colored green, others are colored orange

•   C: the number 6 is colored green, all other numbers are colored orange

•   D: all primes are colored green, others are colored orange

If we draw 5 dices independently from the bag, what is the probability that we get some number that is purely green? Choose the correct answer; explain your answer.

[a]  

[b]  [c]  [d]  [e]  

Experiments with Perceptron Learning Algorithm

Next, we use an artificial data set to study PLA. The data set with N = 100 examples is in

http://www.csie.ntu.edu.tw/~htlin/course/ml21fall/hw1/hw1_train.dat

Each line of the data set contains one (xn,yn) with xn ∈ R10. The first 10 numbers of the line contains the components of xn orderly, the last number is yn. Please initialize your algorithm with w = 0 and take sign(0) as −1.

13.   (*) Please first follow page 29 of lecture 1, and add x0 = 1 to every xn. Implement a version of PLA that randomly picks an example (xn,yn) in every iteration, and updates wt if and only if wt is incorrect on the example. Note that the random picking can be simply implemented with replacement—that is, the same example can be picked multiple times, even consecutively. Stop updating and return wt as wPLA if wt is correct consecutively after checking 5N randomly-picked examples.

Hint: (1) The update procedure described above is equivalent to the procedure of gathering all the incorrect examples first and then randomly picking an example among the incorrect ones. But the description above is usually much easier to implement. (2) The stopping criterion above is a randomized, more efficient implementation of checking whether wt makes no mistakes on the data set.

Repeat your experiment for 1000 times, each with a different random seed. What is the average squared length of wPLA? That is, kwPLAk2? Choose the closest value.

[a]   360

[b]   390

[c]    420

[d]   450

[e]   480

14.   (*) Scale up each xn by 2, including scaling each x0 from 1 to 2. Then, run PLA on the scaled examples for 1000 experiments. What is the average squared length of wPLA? That is, kwPLAk2? Choose the closest value.

[a]   1260

[b]   1410

[c]    1560

[d]   1710

[e]   1860

15.   (*) Scale down each xn (including x0) by kxnk, which makes each scaled example of length 1. Then, run PLA on the scaled examples for 1000 experiments. What is the average squared length of wPLA? That is, kwPLAk2? Choose the closest value.

[a]   3.1

[b]   4.1

[c]    5.1

[d]   6.1

[e]   7.1

16.   (*) Set x0 = 0 to every xn instead of x0 = 1, and do not do any scaling. This equivalently means not adding any x0, and you will get a separating hyperplane that passes the origin. Repeat the 1000 experiments above. What is the average squared length of wPLA? That is, kwPLAk2? Choose the closest value.

[a]   530

[b]   580

[c]    630

[d]   680

[e]   730

More products