10418 Homework 1-LEARNING TO SEARCH Solved

Starting from:

$30

1 Written Questions
Answer the following questions in the template provided. Then upload your solutions to Gradescope. You may use LATEX or print the template and hand-write your answers then scan it in. Failure to use the template may result in a penalty.

1.1 Reductions to Binary Classification
1. (1 point) Can we construct an ECOC matrix that yields the same algorithm as One vs. All classification? If yes, describe the matrix; if not, explain why not.

2. (1 point) Can we construct an ECOC matrix that yields the same algorithm as All vs. All classification?

If yes, describe the matrix; if not, explain why not.

3. (1 point) True or False: One-against-some classifier has logarithmic runtime in the number of examples.

True

False

4. (1 point) True or False: Extreme classification assumes a large label set.

True

False

5. (2 points) For a full binary classification tree of depth d, what is the probability of making a correct classification on X~ . Assume that the probability of a binary classifier at any node making an error is . (A full binary tree is one in which every node, except the leaves, has two children.)

6. Numerical answer: For k-way multi-classification with k = 16, how many classifiers need to be constructed for...

(a) (1 point) ...One vs. All classification?

(b) (1 point) ...All vs. All classification?

7. (1 point) For an arbitrary classification tree with k classes, what is the minimum number of binary classifiers that could be needed?

8. (1 point) For an arbitrary classification tree with k classes, What is the maximum number of classifiers that could be needed? (Assume that the sets of classes for each pair of siblings are mutually exclusive.)

1.2 Learning to Search
In this section, we’ll consider a simple version of learning to search for the task of identifying names in a text. In this setting our action space will be the set {+,−} where + denotes labeling a word as part of a name and − denotes labeling a word as not being part of a name. Our state space will be represented by Sz where z is a vector denoting the labeling of our sentence so far. For example, if we are in state S++− it means we have labeled word 0 as +, word 1 as +, word 2 as −, and we are currently attempting to label word 3.

Throughout this section, we will be referring exclusively to the following input sentence ~x and oracle labeling ~y:

In Figure 1.1 you can see a small part of the search space for this problem.

Figure 1.1

1. (1 point) How many nodes are in the search space for sentence ~x?

3. Suppose our loss function is Hamming loss.

(a) (1 point) What does the oracle policy return for S++−?

(b) (1 point) What does the oracle policy return for S−−−?

4. Suppose our loss function is Hamming loss with an additional error for including only part of a name,

e.g. omitting a person’s title. More precisely,

(a) (1 point) What does the oracle policy return for S−?

(b) (1 point) What does the oracle policy return for S+?

5. Suppose that our scoring function is of the form θT f(xi,yi) where

f(xi,yi) =[I{xi starts with a capital letter and yi = +} I{xi starts with a capital letter and yi = −}, I{ is in our gazetteer list and yi = +},

I{ is in our gazetteer list and yi = −}, I{yi−1 = −,yi = −}]

and θ is a vector of length 5. A gazetteer list is essentially a lookup table of entities that we know are names. Our gazetteer list will include { Xavier, Mellon }.

Assume our initial weight vector θ = [−1,1,−1,1,0].

(a) (1 point) What score would our scoring function assign to the ground truth labeling?

(b) (1 point) What labeling would the greedy policy induced by this scoring function return? Break any ties with +.

(c) (1 point) Suppose we use the Supervised Approach to Imitation Learning. Which (state, action) pairs would be produced by the learning algorithm? Denote the action by either + or −. Denote a state by the partial sequence it corresponds to (e.g. the state +− corresponds to a state that took action + followed by action −).

(d) (1 point) Suppose we use DAgger, Which (state, action) pairs would be produced by the learning algorithm? Use the same denotation of states and actions as in the previous question.

We decide to train our linear model using the Perceptron update rule. That is, if the classifier (aka. greedy policy) makes a positive mistake (i.e. a mistake where yi = +) in its action selection, then we add the feature vector to the weights. If it makes a negative mistake (i.e. a mistake where yi = −), then we subtract the feature vector from the weights. We treat the arrival of each (state, action) pair as a separate online example for the classifier.

(e) (1 point) Using the (state, action) pairs produced by the Supervised Approach to Imitation Learning, what would the new value of θ be after completing the corresponding Perceptron updates?

(f) (1 point) Using the (state, action) pairs produced by DAgger, what would the new value of θ be after completing the corresponding Perceptron updates?

1.3 Recurrent Neural Network Language Models
In this section, we wish to use an Elman Network as a building block to design an RNN language model. Below we define the Elman network recursively.

bt = relu(BT bt−1 + AT at + d) ct = softmax(CT bt + e)

where at,∀t is given as input to the network; A,B,C,d,e are parameters of the network; and the initial hidden units b0 are also treated as parameters. In this problem, we assume at,bt,ct ∈ R2 for all t, i.e. all vectors have length two, and that the parameters matrices and vectors are appropriately defined to preserve those dimensions. Above, the function relu(·) is the Rectified Linear Unit function applied elementwise to its vector-valued parameter. The function softmax(·) is likewise vector-valued and defined below. We use [·]i to explicitly denote the ith element of its argument.

[relu(v)]i , max(0,vi)

[softmax

Figure 1.2 depicts the Elman Network. The parameters are shown in red, but their connections into the computation graph are not shown.

Figure 1.2

Assume that we wish to build a language model p(y) of binary sequences y ∈ {+,−}L of fixed length L. We have training data consisting of sequences D = {y(1),...,y(N)}, , where |y(i)| = L. Assume further that we have pre-encoded the data as sequences of one-hot vectors D0 = {z(1),...,z(N)} such that:

if , then z , if , then z .

For such a pair of vectors, we write that z(i) = one-hot(y(i))

1. (1 point) Short answer: Since we wish to treat this Elman Network as an RNN-LM, how many inputs at will we need for a single training example y ∈ D with |y| = L? Explain your answer.

2. (1 point) Select one: If we decide to train this RNN-LM with Teacher Forcing, we will need to compute a loss function for an input example y ∈ D. Assuming so, how would we define at? Note: Be sure to account for all t in your definition.

3. (1 point) Select one: If we decide to train this RNN-LM with Scheduled Sampling, we will need to compute a loss function for an input example y ∈ D. Assuming we use a schedule that always selects the model policy with probability 1, how would we define at? Note: Be sure to account for all t in your definition.

4. (1 point) Write the cross entropy loss ` for a single training example z ∈ D0 in terms of the units and/or parameters of the RNN-LM: at,bt,ct,A,B,C,d,e.

Suppose we have parameter values as defined below:

A B C (1.1) d e b (1.2)
5. Numerical answer: When computing the probability of the sequence y = [+,−,+], what is the value of the following three quantities? Note: Round each numerical value to two significant figures.

(a) (1 point) b1,1 =

(b) (1 point) b2,1 =

6. Numerical answer: When computing the probability of the sequence y = [+,−,+], what is the value of the following three quantities? Note: Round each numerical value to two significant figures.

(a) (1 point) c1,1 =

(b) (1 point) c2,1 =

7. (1 point) Numerical answer: What is the probability of the sequence y = [+,−,+] according to this RNNLM? Note: Round the numerical value to two significant figures.

p(y) =

8. (1 point) Numerical answer: What is the probability of the (length one!) sequence y0 = [−] according to this RNNLM? Note: Round the numerical value to two significant figures.

p(y0) =

1.4 Empirical Questions
The following questions should be completed after you work through the programming portion of this assignment (Section 2).

1. (10 points) Record your model’s performance on the test set after 10 epochs in terms of Cross Entropy (CE) and Character Error Rate (CER) when trained with the following schemes. Note: Round each numerical value to two significant figures.

Schedule
CE
CER
All Oracle

All Model

β = 0.75

Linear Decay

Exponential Decay

2. (10 points) Plot training and testing cross entropy curves for three different training procedures: All Oracle, All Model, and the fixed β = 0.75 training procedure. Let the x-axis ranges over 10 epochs.

Note: Your plot must be machine generated.

3. In class we saw that we can prove a no-regret bound for sequences of β such that

.

(a) (1 point) Show that a fixed β does not satisfy this condition.

(b) (1 point) Show that exponential decay does satisfy this condition.

(c) (1 point) Did this theoretical difference make a difference in practice? Briefly explain why or why not with respect to this dataset and problem setting.

1.5 Wrap-up Questions
1. (1 point) Multiple Choice: Did you correctly submit your code to Autolab?

Yes

No

2. (1 point) Numerical answer: How many hours did you spend on this assignment?.

2 Programming
Your goal in this assignment is to implement a deep learning model for acoustic speech recognition (ASR). You will implement a function to encode speech data into a fixed length vector, a function to decode this fixed length vector into a sequence of characters, and a variety of strategies to train the overall model.

Your solution for this section must be implemented in PyTorch using the data files we have provided to you. This restriction is because we will be grading your code by hand to check your understanding as well as your model’s performance.

2.1 Task Background
Acoustic speech recognition is the problem of taking in recordings of people talking and predicting what was said. While many successful models try to predict linguistically meaningfully units like phonemes, we will be directly predicting characters from waveforms for simplicity.

Though we will train our model with a standard classification loss (Cross-Entropy), what we really care about is the accuracy of our transcription. For a given target sentence, we define the Character Error Rate as the number of character deletions (d), character insertions (i), and character substitutions (s) need to transform the output transcription to the goal transcription over the number of characters in the output transcription (n).

CER (2.1)
Conveniently, this equation can be calculated with the following snippet of code, which runs a dynamic programming algorithm to compute edit distance:

from nltk.metrics.distance import edit_distance

cer = edit_distance(our_string, goal_string) / len(our_string)

2.2 Data
In order to reduce your workload and computational requirements, we have preprocessed a subset of the TIMIT dataset for you to use in this task.

The data is divided into two folders, train and test. In each folder is a sequence of pairs of numpy files (.npy) and text files (.txt). If a numpy file is named ”xyz.npy”, it’s corresponding transcription can be found in ”xyz.txt”.

Given Input: Preprocessed audio files in numpy array of size 20 × L, where L is the number of timesteps in the audio file.

Goal Output: Preprocessed text files of the transcribed audio.

For this section’s points, you will need to implement data loaders for PyTorch so that we can efficiently train our model batch by batch. Note that because we are working with sequences, we will need all sequences in a batch to have the same length.

(Hint: PyTorch allows you to pass a ”collate fn” to torch.utils.data.DataLoader and provides a function called ”pad sequence” in torch.nn.utils.rnn)

2.3 Seq2Seq Implementation
A sequence to sequence model (commonly abbreviated Seq2Seq) is a model that takes in a sequence of input, like words in an English sentence or samples of a waveform, and outputs another sequence, like words in Arabic or transcribed phonemes. Models of this type are frequently used in machine translation, text to speech applications, and automatic speech recognition.

Seq2Seq models comprise of two parts, an encoder and a decoder. The encoder takes in a sequence of inputs and outputs an encoding that represents all of the information contained in the input. The decode then takes in this encoding and outputs a sequence in the new domain. This process can be seen in the figure below.

Figure 2.1: The encoder and decoder are trained jointly in a Seq2Seq model.

For this section, you must implement a working Seq2Seq model. Your implementation must have both the encoder and decoder as single-layer LSTMs with 50% dropout applied to the input layer to the encoder. Every hidden dimension in your neural networks should be 128 and your embedding size should be 256. Set your optimizer to be Adam with the default PyTorch parameters. Your program should be able to run quickly and easily on a laptop due to the simplicity of our model and the limited size of our data.

2.4 DAgger Implementation
As we’ve seen in class, DAgger is an algorithm for collecting training examples for our model by sometimes following the actions performed by an expert and sometimes following our model’s decisions.

For this section’s points, you will need to implement DAgger as described in Algorithm 1. Note that this algorithm allows βi to vary with timestep i. This allows us to explore various schedules for the β parameter, including some that don’t have the theoretical guarantees discussed in class.

Please implement a general version of DAgger and then the code necessary to run DAgger with (1) a fixed β, (2) a linearly decaying β where β is decreased by 0.05 after each epoch, and (3) an exponentially decaying β, where β = exp−1 × i where i is the current epoch.

Algorithm 1 DAgger

Initialize D ← ∅.

Initialize πˆ1 to any policy in Π. for i=1 to N do

Let πi = βiπ∗ + (1 − βi)πˆi

Sample T-step trajectories using πi.

Get dataset Di = {s,π∗(s)} of visited states by πi and actions given by the expert.

Aggregate datasets: D ← D ∪ Di.

Train classifier πˆi+1 on D.

end for return best πˆi on validation.

2.5 Autolab Submission
You must submit a .tar file named seq2seq.tar containing seq2seq.py, which contains all of your code.

You can create that file by running:

tar -cvf seq2seq.tar seq2seq.py

from the directory containing your code.

Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf. DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all your files should be named in lowercase. You must submit this file to the corresponding homework link on Autolab.

Your code will not be autograded on Autolab. Instead, we will grade your code by hand; that is, we will read through your code in order to grade it. As such, please carefully identify major sections of the code via comments.

3 Collaboration Policy
After you have completed all other components of this assignment, report your answers to the collaboration policy questions detailed in the Academic Integrity Policies for this course.

1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details including names of people who helped you and the exact nature of help you received.

2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details including names of people you helped and the exact nature of help you offered.

3. Did you find or come across code that implements any part of this assignment? If so, include full details including the source of the code and how you used it in the assignment.

More products

COMP5970- Assignment 2a Solved

$30

Add to cart

COMP5970- Assignment 1c Solved

$30

Add to cart

COMP5970- Assignment 1b Solved

$30

Add to cart