Starting from:

$30

CSE447 Assignment 4 Solved

1        CSE 447 and CSE 517 Students: Implementation Problem

This tarball includes two files:

•    15pctmasked.txt contains single sentences, one per line (i.e., newline characters delimit sentences), each with some of its characters “masked”; characters are delimited by whitespace, masked characters are replaced with “<mask>”, spaces are denoted “<s>”, and the end-of-sequence symbol is “<eos>”

•    lm.txt contains a bigram language model over characters; the format on each line is “a space b tab p(b | a)” with special space and end-of-sequence characters denoted as above.

Your job is to decode the sentences, replacing each “masked” character with a character from the bigram model’s vocabulary. For input x, your output y should be the sequence that maximizes p(y) under the bigram language model, subject to the constraint that, for every position i where x is not masked, yi = xi. (In plain English, we want you to find the most likely unmasked sequence consistent with the input, under the model we gave you.) Hint: you should adapt the Viterbi algorithm to solve this problem exactly in polynomial time and space.

Your output file should be in the same format as the input, but with every mask token replaced as described above. Your writeup should explain how you generate a score for each possible choice and how you recover the best sequence. You may use existing libraries and tools for this problem (be sure to acknowledge them in your solution), but we do not expect they will be very helpful.

2        CSE 447 and CSE 517 Students: Based on Eisenstein 9.3 (p. 213)

Construct (i.e., draw on the page; a TA recommends this tool) a finite-state automaton in the style of figures

9.2 and 9.3 (pp. 187–8), which handles the following examples:

•    nation/N, national/ADJ, nationalize/V, nationalizer/N

•    America/N, American/ADJ, Americanize/V, Americanizer/N

Be sure that your FSA does not accept any further derivations, such as *nationalizeral and *Americanizern.

3        CSE 447 and CSE 517 Students: Viterbi for Second-Order Models

At the end of the CRF lecture (slide 106), we proposed a second-order sequence model that defines scores of triplets of labels, s(x,i,yi,yi+1,yi+2). The problem to be solved when we decode with this model is:

`−1

                                                                             yˆ= argmaxXs(x,i,yi,yi+1,yi+2)                                                                       (1)

                                                                                 y∈L`           i=0

Note that there are ` terms in the sum. y0 (which is a special START label) appears only in the first term (i = 0) and y`+1 (which is a special STOP label) appears only in the last term (i = ` − 1). In a conventional trigram-style model, we might also have “s(x,−1,y−1 = START,y0 = START,y1)” in the summation, but including such a term doesn’t add any expressive power to what is written above.

The Viterbi algorithm can be adapted for this model. The form of the algorithm is nearly identical to what we saw in lecture. First, prefix scores are computed “left to right,” each using a formula with a local max, and along with each prefix score we also store a backpointer that stores the “argmax.” Then the backpointers are followed “right to left” to select a label for each word in turn. Here is most of the algorithm:

Input: scores s(x,i,y00,y0,y),∀i ∈ h0,...,`i,∀y00 ∈ L,∀y0 ∈ L,∀y ∈ L

Output: yˆ

1.    Base case: ♥2(y0,y) = s(x,0,START,y0,y)

(Note that the subscript in “♥2” refers to the last position in the triple, here, the word labeled “y,” while the “0” argument to s refers to the first position in the triple.)

2.    For i ∈ h3,...,` − 1i:

•      Solve for ♥i(∗,∗) and bi(∗,∗).

3.    Special case for the end (note a small correction in this line relative to an earlier version of the assignment, which included an equality on the left to “♥`−1(y0,y)” that should not have been there):

 ♥`(y0,y) = s(x,` − 1,y0,y,STOP)+maxs(x,` − 2,y00,y0,y)+ ♥`−1(y00,y0) L

{z          } b`(y0,y) is the “argmax”

The above looks a little different from what you saw in lecture, where we calculated “♥`+1(STOP).” Writing it as we’ve done here is just a slightly different choice that gives you a little more of a hint. To see why, take a close look at slide 70.

4.    (yˆ`−1,yˆ`) ← argmax♥`(y0,y)

y0,y∈L2

5.    For i ∈ h` − 2,...,1i:

•      yˆi ← bi+2(yˆi+1,yˆi+2)

Deliverables:

1.    Give the recurrence for ♥i(y0,y) and for bi(y0,y) (the boxed bit of the algorithm above). We’ve given you the base case and the special case for the end of the sequence.

2.    Analyze the runtime and space requirements of this algorithm as functions of |L| (the number of labels) and ` (the length of the input sequence x and the output sequence yˆ).

4         CSE 517 Students: Introducing PCFGs

In this problem, you’ll become acquainted with probabilistic context-free grammars (PCFGs), a more powerful family of models that can be used to parse sequences of words into trees. Chapters 9 and 10 in the textbook give extensive motivation for these kinds of models, as well as algorithms for their use. This problem will reveal a connection between PCFGs and HMMs. A PCFG is defined as:

•    A finite set of “nonterminal” symbols N

–    A start symbol S ∈ N

•    A finite alphabet Σ, called “terminal” symbols, distinct from N

•    Production rule set R, each of the form “N → α” where

–    The lefthand side N is a nonterminal from N

–    The righthand side α is a sequence of zero or more terminals and/or nonterminals: α ∈ (N∪Σ)∗

∗ Special case: Chomsky normal form constrains α to be either a single terminal symbol or two nonterminals

•    For each N ∈ N, a probability distribution over the rules where N is the lefthand side, p(∗ | N). We will use pr to denote the conditional probability corresponding to rule r ∈ R.

A PCFG defines the probability of a tree t as:

                                                                                       p(t) = Y pcountr     t(r)                                                                                                                       (2)

r∈R

where countt(r) is the number of times rule r is used in the tree t.

The definition of an HMM was given in class; we spell it out more carefully here.

•    A finite set of states L (often referred to as “labels”)

–  A probability distribution over initial states, pstart

–  One state denoted 8 is the special stopping state • A finite set of observable symbols V

•    For each state y ∈ L:

–  An “emission” probability distribution over V, pemission

–  A “transition” probability distribution over next states, ptransition

An HMM defines the probability distribution of a state sequence y and an emission sequence x (length n, excluding the stop symbol) as shown on slide 36 in the lecture.

1.    Demonstrate how to implement any HMM using a PCFG. Do this by showing formally how to construct each element of the PCFG (N, S, Σ, R, and p) from the HMM. Hint: think of a state sequence as a right-branching tree; the nonterminals will correspond to HMM states/labels, and you will probably want to add a new nonterminal to serve as the start state S.

2.    Transform the grammar you defined above into Chomsky normal form. Every tree derived by the original grammar must have a corresponding tree in the new grammar, with the same probability. Hint: his might be easier if you add a special “start” word to the beginning of every sequence recognized by the new grammar, and you will likely need to add some new nonterminals, as well. On page 202 of your textbook, the transformation is sketched out for arbitrary context-free grammars that are not probabilistic; to fully complete this exercise, you need to handle the probabilities as well as the rules, but only need to show how to do it for your HMM-derived PCFG. Try to give a more precise definition of the transformation than is given in the textbook.

3.    Explain how to transform your new PCFG’s derivations back into derivations under the original PCFG.

More products