Starting from:

$30

CS224N-Assignment 2 Tensorflow Softmax, Neural Transition-Based Dependency Parsing Solved

1         Tensorflow Softmax
In this question, we will implement a linear classifier with loss function

J(W) = CE(y,softmax(xW))

Where x is a row vector of features and W is the weight matrix for the model. We will use TensorFlow’s automatic differentiation capability to fit this model to provided data.

Implement the softmax function using TensorFlow in py. Remember that
softmax(x)

Note that you may not use tf.nn.softmax or related built-in functions. You can run basic (nonexhaustive tests) by running python q1softmax.py.

Implement the cross-entropy loss using TensorFlow in py. Remember that
Nc

CE(y,yˆ) = −Xyi log(ˆyi)

i=1

where y ∈ RNc is a one-hot label vector and Nc is the number of classes. This loss is summed over all examples (rows) of a minibatch. Note that you may not use TensorFlow’s built-in cross-entropy functions for this question. You can run basic (non-exhaustive tests) by running python q1softmax.py.

Carefully study the Model class in py. Briefly explain the purpose of placeholder variables and feed dictionaries in TensorFlow computations. Fill in the implementations for addplaceholders and createfeeddict in q1classifier.py.
1

Hint: Note that configuration variables are stored in the Config class. You will need to use these configuration variables in the code.

Implement the transformation for a softmax classifier in the function addpredictionop in py. Add cross-entropy loss in the function addlossop in the same file. Use the implementations from the earlier parts of the problem, not TensorFlow built-ins.
Fill in the implementation for addtrainingop in py. Explain how TensorFlow’s automatic differentiation removes the need for us to define gradients explicitly. Verify that your model is able to fit to synthetic data by running python q1classifier.py and making sure that the tests pass.
Hint: Make sure to use the learning rate specified in Config.

2         Neural Transition-Based Dependency Parsing
In this section, you’ll be implementing a neural-network based dependency parser. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between “head” words and words which modify those heads. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:

A stack of words that are currently being processed.
A buffer of words yet to be processed.
A list of dependencies predicted by the parser.
Initially, the stack only contains ROOT, the dependencies lists is empty, and the buffer contains all words of the sentence in order. At each step, the parse applies a transition to the partial parse until its buffer is empty and the stack is of size 1. The following transitions can be applied:

SHIFT: removes the first word from the buffer and pushes it onto the stack.
LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of the first item and removes the second item from the stack.
RIGHT-ARC: marks the first (most recently added) item on the stack as a dependent of the second item and removes the first item from the stack.
Your parser will decide among transitions at each state using a neural network classifier. First, you will implement the partial parse representation and transition functions.

(6 points, written) Go through the sequence of transitions needed for parsing the sentence “I parsed this sentence correctly”. The dependency tree for the sentence is shown below. At each step, give the configuration of the stack and buffer, as well as what transition was applied this step and what new dependency was added (if any). The first three steps are provided below as an example.
stack
buffer
new dependency
transition
[ROOT]
[I, parsed, this, sentence, correctly]
 
Initial Configuration
[ROOT, I]
[parsed, this, sentence, correctly]
 
SHIFT
[ROOT, I, parsed]
[this, sentence, correctly]
 
SHIFT
[ROOT, parsed]
[this, sentence, correctly]
parsed→I
LEFT-ARC
A sentence containing n words will be parsed in how many steps (in terms of n)?
Briefly explain why.

Implement the init and parsestep functions in the PartialParse class in py. This implements the transition mechanics your parser will use. You can run basic (not-exhaustive) tests by running python q2parsertransitions.py.
Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more efficiently when making predictions about batches of data at a time (i.e., predicting the next transition for a many different partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.
Algorithm 1 Minibatch Dependency Parsing

Input: sentences, a list of sentences to be parsed and model, our model that makes parse decisions

Initialize partialparses as a list of partial parses, one for each sentence in sentences Initialize unfinishedparses as a shallow copy of partialparses while unfinishedparses is not empty do

Take the first batchsize parses in unfinishedparses as a minibatch

Use the model to predict the next transition for each partial parse in the minibatch

Perform a parse step on each partial parse in the minibatch with its predicted transition

Remove the completed parses from unfinishedparses end while

Return: The dependencies for each (now completed) parse in partialparses.

Implement this algorithm in the minibatchparse function in q2parsertransitions.py. You can run basic (not-exhaustive) tests by running python q2parsertransitions.py.

Note: You will need minibatchparse to be correctly implemented to evaluate the model you will build in part (h). However, you do not need it to train the model, so you should be able to complete most of part (h) even if minibatchparse is not implemented yet.

We are now going to train a neural network to predict, given the state of the stack, buffer, and dependencies, which transition should be applied next. First, the model extracts a feature vector representing the current state. We will be using the feature set presented in the original neural dependency parsing paper: A Fast and Accurate Dependency Parser using Neural Networks[1]. The function extracting these features has been implemented for you in parserutils. This feature vector consists of a list of tokens (e.g., the last word in the stack, first word in the buffer, dependent of the second-to-last word in the stack if there is one, etc.). They can be represented as a list of integers

[w1,w2,...,wm]

where m is the number of features and each 0 ≤ wi < |V | is the index of a token in the vocabulary (|V | is the vocabulary size). First our network looks up an embedding for each word and concatenates them into a single input vector:

x = [Lw0,Lw1,...,Lwm] ∈ Rdm

where L ∈ R|V |×d is an embedding matrix with each row Li as the vector for a particular word i. We then compute our prediction as:

h = ReLU(xW + b1)

yˆ = softmax(hU + b2)

(recall that ReLU(z) = max(z,0)). We evaluate using cross-entropy loss:

Nc

J(θ) = CE(y,yˆ) = −Xyi log ˆyi

i=1

To compute the loss for the training set, we average this J(θ) across all training examples.

In order to avoid neurons becoming too correlated and ending up in poor local minimina, it is often helpful to randomly initialize parameters. One of the most frequent initializations used is called Xavier initialization[2].
Given a matrix A of dimension m × n, Xavier initialization selects values Aij uniformly from [], where

Implement the initialization in xavierweightinit in q2initialization.py. You can run basic (nonexhaustive tests) by running python q2initialization.py. This function will be used to initialize W and U.

We will regularize our network by applying Dropout[3]. During training this randomly sets units in the hidden layer h to zero with probability pdrop and then multiplies h by a constant γ (dropping different units each minibatch). We can write this as
hdrop = γd ◦ h

where d ∈ {0,1}Dh (Dh is the size of h) is a mask vector where each entry is 0 with probability pdrop and 1 with probability (1 − pdrop). γ is chosen such that the value of hdrop in expectation equals h:

Epdrop[hdrop]i = hi

for all 0 < i < Dh. What must γ equal in terms of pdrop? Briefly justify your answer.

We will train our model using the Adam[4] Recall that standard SGD uses the update rule
θ ← θ − α∇θJminibatch(θ)

where θ is a vector containing all of the model parameters, J is the loss function, ∇θJminibatch(θ) is the gradient of the loss function with respect to the parameters on a minibatch of data, and α is the learning rate. Adam uses a more sophisticated update rule with two additional steps[5].

First, Adam uses a trick called momentum by keeping track of m, a rolling average of the gradients:
m ← β1m + (1 − β1)∇θJminibatch(θ) θ ← θ − αm

where β1 is a hyperparameter between 0 and 1 (often set to 0.9). Briefly explain (you don’t need to prove mathematically, just give an intuition) how using m stops the updates from varying as much. Why might this help with learning?

Adam also uses adaptive learning rates by keeping track of v, a rolling average of the magnitudes of the gradients:
m ← β1m + (1 − β1)∇θJminibatch(θ) v ← β2v + (1 − β2)(∇θJminibatch(θ) ◦ ∇θJminibatch(θ))

√ θ ← θ − α ◦ m/ v

where ◦ and / denote elementwise multiplication and division (so z ◦ z is elementwise squaring) and β2 is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update by



v, which of the model parameters will get larger updates? Why might this help with learning?

In py implement the neural network classifier governing the dependency parser by filling in the appropriate sections. We will train and evaluate our model on the Penn Treebank (annotated with Universal Dependencies). Run python q2parsermodel.py to train your model and compute predictions on the test data (make sure to turn off debug settings when doing final evaluation).
Hints:

When debugging, pass the keyword argument debug=True to the main method (it is set to true by default). This will cause the code to run over a small subset of the data, so the training the model won’t take as long.
This code should run within 1 hour on a CPU.
When running with debug=False, you should be able to get a loss smaller than 0.07 on the train set (by the end of the last epoch) and an Unlabeled Attachment Score larger than 88 on the dev set (with the best-performing model out of all the epochs). For comparison, the model in the original neural dependency parsing paper gets 92.5. If you want, you can tweak the hyperparameters for your model (hidden layer size, hyperparameters for Adam, number of epochs, etc.) to improve the performance (but you are not required to do so).
Add an extension to your model (e.g., l2 regularization, an additional hidden layer) and report the change in UAS on the dev set. Briefly explain what your extension is and why it helps (or hurts!) the model. Some extensions may require tweaking the hyperparameters in Config to make them effective.
3         Recurrent Neural Networks: Language Modeling
In this section, you’ll compute the gradients of a recurrent neural network (RNN) for language modeling.

Language modeling is a central task in NLP, and language models can be found at the heart of speech recognition, machine translation, and many other systems. Given a sequence of words (represented as onehot row vectors) x(1),x(2),...,x(t), a language model predicts the next word x(t+1) by modeling:

P(x(t+1) = vj | x(t),...,x(1))

where vj is a word in the vocabulary.

Your job is to compute the gradients of a recurrent neural network language model, which uses feedback information in the hidden layer to model the “history” x(t),x(t−1),...,x(1). Formally, the model[6] is, for t = 1,...,n − 1:

e(t) = x(t)L h(t) = sigmoid yˆ(t) = softmax

where h(0) = h0 ∈ RDh is some initialization vector for the hidden layer and x(t)L is the product of L with the one-hot row vector x(t) representing the current word. The parameters are:

                                                   L ∈ R|V |×d H ∈ RDh×Dh I ∈ Rd×Dh b1 ∈ RDh U ∈ RDh×|V | b2 ∈ R|V |                              (1)

where L is the embedding matrix, I the input word representation matrix, H the hidden transformation matrix, and U is the output word representation matrix. b1 and b2 are biases. d is the embedding dimension, |V | is the vocabulary size, and Dh is the hidden layer dimension.

The output vector yˆ(t) ∈ R|V | is a probability distribution over the vocabulary. The model is trained by minimizing the (un-regularized) cross-entropy loss:

|V |

J(t)(θ) = CE(y(t),yˆ(t)) = −Xyj(t) log ˆyj(t)

j=1

where y(t) is the one-hot vector corresponding to the target word (which here is equal to x(t+1)). We average the cross-entropy loss across all examples (i.e., words) in a sequence to get the loss for a single sequence.

Conventionally, when reporting performance of a language model, we evaluate on perplexity, which is defined as:
PP

P¯(x(pred+1) = x(t+1) | x(t),...,x

i.e. the inverse probability of the correct word, according to the model distribution P¯. Show how you can derive perplexity from the cross-entropy loss (Hint: remember that y(t) is one-hot!), and thus argue that minimizing the (arithmetic) mean cross-entropy loss will also minimize the (geometric) mean perplexity across the training set. This should be a very short problem - not too perplexing!

For a vocabulary of |V | words, what would you expect perplexity to be if your model predictions were completely random (chosen uniformly from the vocabulary)? Compute the corresponding cross-entropy loss for |V | = 10000.

Compute the gradients of the loss J with respect to the following model parameters at a single point in time t (to save a bit of time, you don’t have to compute the gradients with the respect to U and b1):
where Lx(t) is the row of L corresponding to the current word x(t), and  denotes the gradient for the appearance of that parameter at time t (equivalently, h(t−1) is taken to be fixed, and you need not backpropagate to earlier timesteps just yet - you’ll do that in part (c)).

Additionally, compute the derivative with respect to the previous hidden layer value:

Below is a sketch of the network at a single timestep:
Draw the “unrolled” network for 3 timesteps, and compute the backpropagation-through-time gradients:

∂J(t)

∂Lx(t−1)

where  denotes the gradient for the appearance of that parameter at time (t − 1). Because parameters are used multiple times in feed-forward computation, we need to compute the gradient for each time they appear.

You should use the backpropagation rules from Lecture 5[7] to express these derivatives in terms of error term  computed in the previous part. (Doing so will allow for re-use of expressions for t − 2, t − 3, and so on).

Note that the true gradient with respect to a training example requires us to run backpropagation all the way back to t = 0. In practice, however, we generally truncate this and only backpropagate for a fixed number τ ≈ 5 − 10 timesteps.

(d) Given h(t−1), how many operations are required to perform one step of forward propagation to compute J(t)(θ)? How about backpropagation for a single step in time? For τ steps in time? Express your answer in big-O notation in terms of the dimensions d, Dh and |V | (Equation 1).

Wha

More products