Starting from:

$30

CS114-Programming Assignment 1 Finite State Transducers Solved

This assignment is on finite state transducers and will require you to create transducers in Python. There are two problems, and we have provided some code to get you started, as well as a few utilities that will make it easier for you to debug and test your code.

Some minimal test cases are provided as well. Before you turn in your work, you should at least pass all of the test cases we provided although it is neither sufficient nor necessary for getting full credit. You should add your own test cases to help yourself debug your program.


Problem 1: Soundex[1]
The Soundex algorithm is a phonetic algorithm commonly used by libraries and the Census Bureau to represent people’s names as they are pronounced in English. It has the advantage that name variations with minor spelling differences will map to the same representation, as long as they have the same pronunciation in English. Here is how the algorithm works:

Step 1: Retain the first letter of the name. This may be uppercased or lowercased.

Step 2: Remove all non-initial occurrences of the following letters: a, e, h, i, o, u, w, y. (To clarify, this step removes all occurrences of the given characters except when they occur in the first position.)

Step 3: Replace the remaining letters (except the first) with numbers:

b, f, p, v→ 1 c, g, j, k, q, s, x, z→ 2 d, t→ 3 l→ 4 m, n→ 5 r→ 6

If two or more letters from the same number group were adjacent in the original name, then only replace the first of those letters with the corresponding number and ignore the others. This rule also applies at the beginning of the name: retain the first letter, but ignore the others.

Step 4: If there are more than 3 digits in the resulting output, then drop the extra ones.

Step 5: If there are less than 3 digits, then pad at the end with the required number of trailing zeros.

The final output of applying Soundex algorithm to any input string should be of the form Letter Digit Digit Digit. Table 1 shows the output of the Soundex algorithm for some example names.

Input
Output
Jurafsky
J612
Jarovski
J612
Resnik
R252
Reznick
R252
Euler
E460
Peterson
P362
Ashcroft
A226
Pfister
P236
Table 1: Example outputs for the Soundex algorithm.

Construct an fst that implements the Soundex algorithm. Obviously, it is non-trivial to implement a single transducer for the entire algorithm. Therefore, the strategy we will adopt is a bottom-up one: implement multiple transducers, each performing a simpler task, and then compose them together to get the final output. One possibility is to partition the algorithm across three transducers:

1.   Transducer 1: (letters_to_numbers) Performs steps 1-3 of the algorithm, i.e, retaining the first letter, removing letters and replacing letters with numbers.

2.   Transducer 2: (truncate_to_three_digits) Performs step 4 of the algorithm, i.e., truncating extra digits.

3.   Transducer 3: (add_zero_padding) Performs step 5 of the algorithm, i.e., padding with zeros if required.

Note that each of these three transducers will have characters as input/output symbols.

To make things easier for you, we have provided the file soundex.py which is where you will write your code. It already imports all needed modules and functions (including fsmutils.py). It also creates three transducer objects—as dictated by the bottom-up strategy outlined above—such that all you should have to do is to figure out the states and arcs required by each transducer. It also contains code that allows you to input a single name on the command line to get the output.

Note that while we have provided you with sample unit tests containing some names, it might be very useful to test your code on other names. For comparison purposes, you may use one of the many Soundex calculators available online to create more test cases. (Note that some calculators may use additional rules; you only need to implement the rules listed above.)

Problem 2: English Morphology
English often requires some spelling changes at the morpheme boundary between the stem and the affixes that indicate the inflection of the verb. Specifically, we will build a finite state transducer to handle the K-insertion rule in English.

If a verb ends with vowel + c, then we add k before adding in the -ed and -ing suffixes. For example:

panicked ↔ panic + ed ↔ panic+past form panicking ↔ panic + ing ↔ panic+present participle form lick ↔ lick + ed ↔ lick+past form lick ↔ lick + ing ↔ lick+present participle form

As you can see in the examples, the morphological parsing process can be done in two steps (although you may skip the intermediate step). So you might want to (but don’t have to) build two separate FSTs and compose them. The lexicon FST only has to include two inflectional morphemes (-ed and -ing) and five verbs: want, sync, panic, havoc, and lick. The other FST will handle the actual K-insertion rule.

The parser FST will be used as a morphological parser to transduce the surface form (e.g panicked) to the lexical form (panic + past form). The generator FST will be used to transduce the lexical form to the surface form.


 
[1] This problem is adapted from Jordan Boyd-Graber’s course at UMD.

More products