$25
Homework 3
CSE 541: Interactive Learning
Contextual Bandits
1. Problem 18.8 of [SzepesvariLattimore].
2. In this exercise we will implement several contextual bandit algorithms. We will “fake” a contextual bandit problem with multi-class classification dataset where each example is context, and the learner chooses an “action” among the available class labels, and receives a reward of 1 if the guess was correct, and 0 otherwise. However, keeping with bandit feedback, we assume the learner only knows the reward of the action played, not all actions.
We will use the MNIST dataset[1]. The MNIST dataset contains 28x28 images of handwritten digits from 0-9. Download this dataset and use the python-mnist library[2] to load it into Python. Rather than using the full images, you may run PCA on the data to come up with a lower dimensional representation of each image. You will have to experiment with what dimension, d, to use. Scale all images so that they are norm
1.
Let the d dimensional representation of the tth image in the dataset, ct, be our “context.” Our action set A = {0,1,...,9} has 10 actions associated with each label. For each i ∈ A = {0,1,...,9} define the feature map φ(c,i) = vec(ce>i ) ∈ R10d. If v(c,a) is the expected reward of playing action a ∈ A in response to context c, then let us “model the world” with the simple linear model so that v(c,a) ≈ hθ∗,φ(c,a)i for some unknown θ∗ ∈ R10d. Of course, when actually playing the game we will observe image features ct as the context, choose an “action” at ∈ {0,...,9}, and receive reward rt = 1{at = yt} where yt is the true label of the image ct and at is the action played.
Implement the Explore-Then-Commit algorithms, Follow-The-Leader, LinUCB, and Thompson Sampling algorithms for this problem. You can use just the training set of T = 50000 examples. The training set is class balanced meaning that there are 5000 examples of each digit. Important: randomly shuffle the dataset so the probability of any particular class showing up at any given time is 1/10. The algorithms work as follows:
• Explore-Then-Commit (“Model the world”): Fix τ ∈ [T]. For the first τ steps, select each action a ∈ A uniformly at random. Compute θb = argmin . For t > τ play at = argmaxa∈Ahφ(ct,a),θbi. Choose a value of τ and justify it.
• Explore-Then-Commit (“Model the bias”): Fix τ ∈ [T]. For the first τ steps, select each action a ∈ A uniformly at random. Our goal is to identify a policy πb : C → A using the dataset {(ct,at,pt,rt)}t≤τ such that
π = argmax b
t=1
= argmin
t t=1
= argmin X 1{π(ct) 6= at} π∈Π
t∈[τ]:rt=1
where the last line uses the fact that pt = 1/10 due to uniform exploration and the definition of rt.
Note that this is just a multi-class classification problem on dataset {(ct,at)}t∈[τ]:rt=1 where one is trying to identify a classifier πb : C → A that predicts label at from features ct. Train a 10-class linear logistic classifier[3] πb on the data up to time [τ] and then for t > τ play at = argmaxa∈{0,...,9} πb(ct). Choose the same value of τ as “Model the world”.
1
• Follow-The-Leader: Fix τ ∈ [T]. For the first τ steps, select each action a ∈ A uniformly at random.
For t > τ play at = argmaxa∈Ahφ(ct,a),θbt−1i where = argmin . Choose a value of τ and justify it.
• LinUCB Using Ridge regression with an appropriate γ > 0 (γ = 1 may be okay) construct the confidence set Ct derived in class (and in the book). At each time t ∈ [T] play at = argmaxa∈A maxθ∈Cthθ,φ(ct,a)i.
• Thompson Sampling Fix γ > 0 (γ = 1 may be okay). At time t ∈ [T] draw
at = argmax where = argmin and Vt = γI +
Ps φ(cs,as)φ(cs,as)>.
=1
Implement each of these algorithms and show a plot of the regret (all algorithms on one plot) when run on MNIST for good choices of τ,γ. Hint, for computing Vt−1 efficiently see https://en.wikipedia.org/ wiki/Sherman%E2%80%93Morrison_formula.
2
[1] http://yann.lecun.com/exdb/mnist/
[2] https://pypi.org/project/python-mnist/
[3] Please feel free to use an off-the-shelf method to train logistic regression such as https://scikit-learn.org/stable/auto_ examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py