Starting from:

$30

CS224N-Assignment 3 Solved

A primer on named entity recognition
In this assignment, we will build several different models for named entity recognition (NER). NER is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. In the assignment, for a given a word in a context, we want to predict whether it represents one of four categories:

•    Person (PER): e.g. “Martha Stewart”, “Obama”, “Tim Wagner”, etc. Pronouns like “he” or “she” are not considered named entities.

•    Organization (ORG): e.g. “American Airlines”, “Goldman Sachs”, “Department of Defense”.

•    Location (LOC): e.g. “Germany”, “Panama Strait”, “Brussels”, but not unnamed locations like “the bar” or “the farm”.

•    Miscellaneous (MISC): e.g. “Japanese”, “USD”, “1,000”, “Englishmen”.

We formulate this as a 5-class classification problem, using the four above classes and a null-class (O) for words that do not represent a named entity (most words fall into this category). For an entity that spans multiple words (“Department of Defense”), each word is separately tagged, and every contiguous sequence of non-null tags is considered to be an entity.

Here is a sample sentence (x(t)) with the named entities tagged above each token (y(t)) as well as hypothetical predictions produced by a system (yˆ(t)):

1

y(t)
ORG
ORG
O O
O
ORG
ORG
...
O
PER PER
O
yˆ(t)
MISC
O
O O
O
ORG
O
...
O
PER PER
O
x(t)
American
Airlines,
a    unit
of
AMR
Corp.,
...
spokesman
Tim     Wagner
said.
In the above example, the system mistakenly predicted “American” to be of the MISC class and ignores “Airlines” and “Corp.”. All together, it predicts 3 entities, “American”, “AMR” and “Tim Wagner”.

To evaluate the quality of a NER system’s output, we look at precision, recall and the F1 measure.[1] In particular, we will report precision, recall and F1 at both the token-level and the name-entity level. In the former case:

•    Precision is calculated as the ratio of correct non-null labels predicted to the total number of non-null labels predicted (in the above example, it would be 

•    Recall is calculated as the ratio of correct non-null labels predicted to the total number of correct non-null labels (in the above example, it would be 

•    F1 is the harmonic mean of the two:   . (in the above example, it would be 

For entity-level F1:

•    Precision is the fraction of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. In our example, “AMR” would be marked incorrectly because it does not cover the whole entity, i.e. “AMR Corp.”, as would “American”, and we would get a precision score of

 .

•    Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions. Here, we would get a recall score of  .

•    Finally, the F1 score is still the harmonic mean of the two, and would be   in the example.

Our model also outputs a token-level confusion matrix[2]. A confusion matrix is a specific table layout that allows visualization of the classification performance. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

1.     A window into NER
Let’s look at a simple baseline model that predicts a label for each token separately using features from a window around it.

 

                                                                 Figure 1:    A sample input sequence

def

Figure 1 shows an example of an input sequence and the first window from this sequence. Let x = x(1),x(2),...,x(T) be an input sequence of length T and y def= y(1),y(2),...,y(T) be an output sequence, also of length T. Here, each element x(t) and y(t) are one-hot vectors representing the word at the t-th index of the sentence. In a window based classifier, every input sequence is split into T new data points, each representing a window and its label. A new input is constructed from a window around x(t) by concatenating w tokens to the left and right of x(t): x˜(t) def= [x(t−w),...,x(t),...,x(t+w)]; we continue to use y(t) as its label. For windows centered around tokens at the very beginning of a sentence, we add special start tokens (<START) to the beginning of the window and for windows centered around tokens at the very end of a sentence, we add special end tokens (<END) to the end of the window. For example, consider constructing a window around “Jim” in the sentence above. If window size were 1, we would add a single start token to the window (resulting in a window of [<START, Jim, bought]). If window size were 2, we would add two start tokens to the window (resulting in a window of [<START, <START, Jim, bought, 300]).

With these, each input and output is of a uniform length (w and 1 respectively) and we can use a simple feedforward neural net to predict y(t) from x˜(t):

As a simple but effective model to predict labels from each window, we will use a single hidden layer with a ReLU activation, combined with a softmax output layer and the cross-entropy loss:

e(t) = [x(t−w)L,...,x(t)L,...,x(t+w)L] h(t) = ReLU(e(t)W + b1)

yˆ(t) = softmax(h(t)U + b2)

J = CE(y(t),yˆ(t))

CE(y(t),yˆ(t)) = −Xyi(t) log(ˆyi(t)).

i

where L ∈RV×D are word embeddings, h(t) is dimension H and yˆ(t) is of dimension C, where V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5).

(a)    

i.       Provide 2 examples of sentences containing a named entity with an ambiguous type (e.g. the entity could either be a person or an organization, or it could either be an organization or not an entity).

ii.     Why might it be important to use features apart from the word itself to predict named entity labels?

iii.    Describe at least two features (apart from the word) that would help in predicting whether a word is part of a named entity or not.

(b)    

i.       What are the dimensions of e(t), W and U if we use a window of size w?

ii.     What is the computational complexity of predicting labels for a sentence of length T?

(c)     Implement a window-based classifier model in q1window.py using this approach.

To do so, you will have to:

i.       Transform a batch of input sequences into a batch of windowed input-output pairs in the makewindoweddata function. You can test your implementation by running python q1window.py test1.

ii.     Implement the feed-forward model described above by appropriately completing functions in the WindowModel class. You can test your implementation by running python q1window.py test2.

iii.    Train your model using the command python q1window.py train. The code should take only about 2–3 minutes to run and you should get a development score of at least 81% F1.

The model and its output will be saved to results/window/<timestamp/, where <timestamp is the date and time at which the program was run. The file results.txt contains formatted output of the model’s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F1 scores computed during the training.

Finally, you can interact with your model using:

.

(d)  (written) Analyze the predictions of your model using the files generated above.

i.     Report your best development entity-level F1 score and the corresponding token-level confusion matrix. Briefly describe what the confusion matrix tells you about the errors your model is making.

ii.   Describe at least 2 modeling limitations of the window-based model and support these conclusions using examples from your model’s output (i.e. identify errors that your model made due to its limitations). You can also support your conclusions using predictions made by your model on examples manually entered through the shell.

2.     Recurrent neural nets for NER
We will now tackle the task of NER by using a recurrent neural network (RNN).

 

Recall that each RNN cell combines the hidden state vector with the input using a sigmoid. We then

use the hidden state to predict the output at each timestep:

e(t) = x(t)L h(t) = σ(h(t−1)Wh + e(t)Wx + b1)

yˆ(t) = softmax(h(t)U + b2),

where L ∈RV×D are word embeddings, Wh ∈RH×H, Wx ∈RD×H and b1 ∈RH are parameters for the RNN cell, and U ∈RH×C and b2 ∈RC are parameters for the softmax. As before, V is the size of the vocabulary, D is the size of the word embedding, H is the size of the hidden layer and C are the number of classes being predicted (here 5).

In order to train the model, we use a cross-entropy loss for the every predicted token:

T

J = XCE(y(t),yˆ(t))

t=1

CE(y(t),yˆ(t)) = −Xyi(t) log(ˆyi(t)).

i

(a)    (written)

i.     How many more parameters does the RNN model in comparison to the window-based model?

ii.   What is the computational complexity of predicting labels for a sentence of length T (for the RNN model)?

(b)    Recall that the actual score we want to optimize is entity-level F1.

i.     Name at least one scenario in which decreasing the cross-entropy cost would lead to an decrease in entity-level F1 scores.

ii.   Why it is difficult to directly optimize for F1?

(c)     (code) Implement an RNN cell using the equations described above in the rnncell function of q2rnncell.py. You can test your implementation by running python q2rnncell.py test.

(d)   code/written) Implementing an RNN requires us to unroll the computation over the whole sentence. Unfortunately, each sentence can be of arbitrary length and this would cause the RNN to be unrolled a different number of times for different sentences, making it impossible to batch process the data.

The most common way to address this problem is pad our input with zeros. Suppose the largest sentence in our input is M tokens long, then, for an input of length T we will need to:

1.   Add 0-vectors to x and y to make them M tokens long.

2.   Create a masking vector, (  which is 1 for all t ≤ T and 0 for all t T. This masking vector will allow us to ignore the predictions that the network makes on the padded input.[3]

3.   Of course, by extending the input and output by M − T tokens, we might change our loss and hence gradient updates. In order to tackle this problem, we modify our loss using the masking vector:

M

J = Xm(t) CE(y(t),yˆ(t)).

t=1

i.      (written) How would the loss and gradient updates change if we did not use masking?

How does masking solve this problem?

ii.    (code) Implement padsequences in your code. You can test your implementation by running python q2rnn.py test1.

(e)    (code) Implement the rest of the RNN model assuming only fixed length input by appropriately completing functions in the RNNModel class. This will involve:

1.   Implementing the addplaceholders, addembedding, addtrainingop functions.

2.   Implementing the addpredictionop operation that unrolls the RNN loop self.maxlength times. Remember to reuse variables in your variable scope from the 2nd timestep onwards to share the RNN cell weights Wx and Wh across timesteps.

3.   Implementing addlossop to handle the mask vector returned in the previous part.

You can test your implementation by running python q2rnn.py test2.

(f)     (code) Train your model using the command python q2rnn.py train. Training should take about 2 hours on your CPU and 10–20 minutes if you use the GPUs provided by Microsoft Azure. You should get a development F1 score of at least 85%.

The model and its output will be saved to results/rnn/<timestamp/, where <timestamp is the date and time at which the program was run. The file results.txt contains formatted output of the model’s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F1 scores computed during the training.

Finally, you can interact with your model using:

python q2rnn.py shell -m results/rnn/<timestamp/
(g)  (written)

i.     Describe at least 2 modeling limitations of this RNN model and support these conclusions using examples from your model’s output.

ii.   For each limitation, suggest some way you could extend the model to overcome the limitation.

3.     Grooving with GRUs
In class, we learned that a gated recurrent unit (GRU) is an improved RNN cell that greatly reduces the problem of vanishing gradients. Recall that a GRU is described by the following equations:

z(t) = σ(x(t)Uz + h(t−1)Wz + bz) r(t) = σ(x(t)Ur + h(t−1)Wr + br) h˜(t) = tanh(x(t)Uh + r(t) ◦h(t−1)Wh + bh) h(t) = z(t) ◦h(t−1) + (1 −z(t)) ◦h˜(t),

where z(t) is considered to be an update gate and r(t) is considered to be a reset gate.[4]

Also, to keep the notation consistent with the GRU, for this problem, let the basic RNN cell be described by the equations:

h(t) = σ(x(t)Uh + h(t−1)Wh + bh).

To gain some inutition, let’s explore the behavior of the basic RNN cell and the GRU on some generated 1-D sequences.

(a)   ( (written) Modeling latching behavior. Let’s say we are given input sequences starting with a 1 or 0, followed by n 0s, e.g. 0, 1, 00, 10, 000, 100, etc. We would like our state h to continue to remember what the first character was, irrespective of how many 0s follow. This scenario can also be described as wanting the neural network to learn the following simple automaton:

 x = 0,1

In other words, when the network sees a 1, it should change its state to also be a 1 and stay there.

In the following questions, assume that the state is initialized at 0 (i.e. h(0) = 0), and that all the parameters are scalars. Further, assume that all sigmoid activations and tanh activations are replaced by the indicator function:

                                                             (                                                                           (

1          if x 0    1             if x 0 σ(x) →               tanh(x) →            .

                                                               0    otherwise                                                     0   otherwise

i.       Identify values of wh, uh and bh for an RNN cell that would allow it to replicate the behavior described by the automaton above.

ii.     Let wr = ur = br = bz = bh = 0. Identify values of wz, uz, wh and uh for a GRU cell that would allow it to replicate the behavior described by the automaton above.

(b)    (written) Modeling toggling behavior. Now, let us try modeling a more interesting behavior. We are now given an arbitrary input sequence, and must produce an output sequence that switches from 0 to 1 and vice versa whenever it sees a 1 in the input. For example, the input sequence 00100100 should produce 00111000. This behavior could be described by the following automaton:

 

x = 1

Once again, assume that the state is initialized at 0 (i.e. h(0) = 0), that all the parameters are scalars, that all sigmoid activations and tanh activations are replaced by the indicator function.

i.       (3 points) Show that a 1D RNN can not replicate the behavior described by the automaton above.

ii.     (3 points) Let wr = ur = bz = bh = 0. Identify values of br, wz, uz, wh and uh for a GRU cell that would allow it to replicate the behavior described by the automaton above.

(c)     (code) Implement the GRU cell described above in q3grucell.py. You can test your implementation by running python q3grucell.py test.

(d)    (code) We will now use an RNN model to try and learn the latching behavior described in part (a) using TensorFlow’s RNN implementation: tf.nn.dynamicrnn.

i.       In q3gru.py, implement addpredictionop by applying TensorFlow’s dynamic RNN model on the sequence input provided. Also apply a sigmoid function on the final state to normalize the state values between 0 and 1.

ii.     Next, write code to calculate the gradient norm and implement gradient clipping in addtrainingop.

iii.    Run the program:

python q3gru.py predict -c [rnn|gru] [-g]
to generate a learning curve for this task for the RNN and GRU models. The -g flag activates gradient clipping.

These commands produce a plot of the learning dynamics in q3-noclip-<model.png and q3-clip-<model.png respectively.

(e)   (written) Analyze the graphs obtained above and describe the learning dynamics you see. Make sure you address the following questions:

i.     Does either model experience vanishing or exploding gradients? If so, does gradient clipping help?

ii.   Which model does better? Can you explain why?

(f)     (code) Run the NER model from question 2 using the GRU cell using the command: python q2rnn.py train -c gru

Training should take about 3–4 hours on your CPU and about 30 minutes if you use the GPUs provided by Microsoft Azure. You should get a development F1 score of at least 85%.

The model and its output will be saved to results/gru/<timestamp/, where <timestamp is the date and time at which the program was run. The file results.txt contains formatted output of the model’s predictions on the development set, and the file log contains the printed output, i.e. confusion matrices and F1 scores computed during the training.


 

More products