$30
Recurrent Neural Network
1. [Vanilla RNN for parity function]
Let us define a sequence parity function as a function that takes in a sequence of binary inputs and returns a sequence indicating the number of 1’s in the input so far; specifically, if at time t the 1’s in the input so far is even it returns 1, and 0 if it is odd. For example, given input sequence [0,1,0,1,1,0], the parity sequence is [1,0,0,1,0,0]. Let xt denote the input at time t and yt be the boolean output of this parity function at time t.
Design a vanilla recurrent neural network to implement the parity function. Your implementation should include the equations of your hidden units and the details about activations with different input at xt (0 or 1).
Hint: Recall that we have implemented AND and OR functions in problem set 2, which may be useful here.
2. [LSTM for parity function]
Let us recall different gates in an LSTM Network. First gate is the "forget gate layer":
ft = σ(Wf.[ht−1,xt] + bf) (1)
where ft is the output of forget gate, Wf is the weight matrix, ht−1 is the hidden state of step t-1, xt is the current input and bt is the bias value. Next we have "input gate layer":
it = σ(Wi.[ht−1,xt] + bi) (2) C˜t = tanh(WC.[ht−1,xt] + bC) (3)
where it decides which values we will update and C˜t are the new candidate values that could be added to the cell state. Next we have new cell state candidate values:
Ct = ft ∗ Ct−1 + it ∗ C˜t (4)
Finally, we have the output and hidden state
ot = σ(Wo.[ht−1,xt] + bo) (5)
ht = ot ∗ tanh(Ct) (6)
Design an LSTM Network for the bit parity problem mentioned in Question 1. Specifically, provide values for Wf, bf, Wi, bi, WC, bC, Wo and bo such that the cell state Ct store the parity of bit string. Please mention any assumptions you make. For this problem, you can assume below for Sigmoid and tanh function:
(
1, if x 0
σ(x) = (7)
0, otherwise
1,
tanh(x) = 0,
−1,
if x 0 x = 0
if x < 0
(8)
Hint: Recall that XOR of x and y can be represented as (x ∧ y¯) ∨ (x¯ ∧ y). Think about how you can leverage this information for equation (4).
3. [When to stop in beam search]
Beam Search is a technique widely used to improve the output text quality of neural text generation. But it is difficult to decide when to stop beam search to obtain optimality because hypotheses can finish in different steps. Assume an input x from which we generate an output sentence y which is a completed hypothesis:
y∗ = argmax Y p(yi|x,y<i) (9)
y:comp(y) i≤|y|
where y<i is a shorthand notation of y0y1...yi−1 and comp(y) is shorthand for completed hypothesis. We say that a hypothesis y is completed (comp(y)), if its last word is </s, i.e.,
def
comp(y) = (y|y| = </s)
in which case it will not be further expanded.
Given Bi−1 is an b length ordered list, consisting of pairs hy,si of the current prefix y and the accumulated score (product of probabilities) s , we use beam search to approximately find the best output y∗. Beam search can be defined as an operator topb that selects the top b scoring items:
(10)
Let us define a beam search algorithm that will always return the optimal score complete hypothesis (modulo beam size) and finish as soon as optimality is reached. Let best≤i be the best completed hypothesis so far up to step i, i.e.
∆ n o
best≤i=max y ∈ ∪j≤iBj|comp(y) (11)
We update it every time we find a completed hypothesis (if there is none yet, then it remains undefined). Given that at any step i, if best≤i is defined and the highest scoring item in the current beam Bi scores worse than or equal to best≤i, prove that the current best completed hypothesis (obtained from best≤i) is the overall highest-probability completed hypothesis and future steps will be no better than the best completed hypothesis.
4. [Exploding Gradients]
Learning long-term dependencies in recurrent networks suffers from a particular numerical challenge – gradients propagated over many time-steps tend to either ‘vanish’ (i.e. converge to 0, frequently) or ‘explode’ (ı.e. diverge to infinity; rarely, but with more damage to the optimization). To study this problem in a simple setting, consider the following recurrence relation without any nonlinear activation function or input x:
ht = Wht−1 (12)
where W is a weight sharing matrix for recurrent relation at any time t. Let λ1,...,λn be the eigenvalues of the weight matrix W ∈ Cn×n. Its spectral radius ρ(W) is defined as:
ρ(W) = max{|λ1|,...,|λn|} (13)
Assuming the initial hidden state is h0, write the relation between hT and h0 and explain the role of the eigenvalues of W in determining the ‘vanishing’ or ‘exploding’ property as
5. [Paper Review]
Explainability is an increasingly important area of deep learning research today. The following paper, from CVPR 2018, introduces a unique multi-modal approach to explainability by joint textual rationale generation and attention visualization in the task of visual question answering (VQA). This is a major improvement over pre-existing unimodal explainable models, with the results showing complementary explanatory strengths from the visual and textual explanations. The paper can be accessed here.
As in the previous assignments, please limit your reviews to 350 words.
Briefly summarize the key findings, strengths and potential limitations of this work.
2 Coding: Sequence models for image captioning
The coding part of this assignment will consist of implementation of sequence models for captioning images