$30
In this assignment, you will experiment with different aspects of modeling, learning, and inference with chain-structured conditional random fields (CRFs). This assignment focuses on the task of optical character recognition (OCR). We will explore an approach that bridges computer vision and natural language processing by jointly modeling the labels of sequences of noisy character images that form complete words. This is a natural problem for chain-structured CRFs. The node potentials can capture bottom-up information about the character represented by each image, while the edge potentials can capture information about the co-occurrence of characters in adjacent positions within a word.
Data: The underlying data are a set of N sequences corresponding to images of the characters in individual words. Each word i consists of Li positions. For each position j in word i, we have a noisy binary image of the character in the that position. In this assignment, we will use the raw pixel values of the character images as features in the CRF. The character images are 20 × 16 pixels. We convert them into 1 × 320 vectors. We include a constant bias feature along with the pixels in each image, giving a final feature vector of length F = 321. xijf indicates the value of feature f in position j of word i. The provided training and test files train_img<i.txt and test_img<i.txt list the character image xij on row j of file i as a 321-long, space-separated sequence.[1] The data files are in the column-major format. Given the sequence of character images xi = [xi1,...,xiLi] corresponding to test word i, your goal is to infer the corresponding sequence of character labels yi = [yi1,...,yiLi]. There are C = 26 possible labels corresponding to the lower case letters “a" to “z." The character labels for each training and test word are available in the files train_words.txt and test_words.txt. The figure below shows several example words along with their images.
shoot indoor threee trait
Model: The conditional random field model is a conditional model PW(yi|xi) of the sequence of class labels yi given the sequence of feature vectors xi that depends on a collection of parameters W. The CRF graphical model is shown below for a sequence of length 4.
Conditional Random Field
The probabilistic model for the CRF we use in this assignment is given below. The CRF model contains one feature parameter WcfF for each of the C labels and F features. The feature parameters encode the compatibility between feature values and labels. The CRF also contains one transition parameter for each pair of labels c and c0. The transition parameters encode the compatibility between adjacent labels in the sequence. We parameterize the model in log-space, so all of the parameters can take arbitrary (positive or negative) real values. We have one feature potential φFj (yij,xij) for each position j in sequence i and one transition potential for each pair of adjacent labels φTj (yij,yij+1) in sequence i.
C F
φFj (yij,xij) = XXWcfF [yij = c]xijf
c=1 f=1
C C
φTj (yij,yij+1) = XX WccT0[yij = c][yij+1 = c0]
c=1 c0=1
Given this collection of potentials, the joint energy function on xi and yi is defined below.
Li Li−1
EW(yi,xi) = −XφFj (yij,xij) + X φTj (yij,yij+1)
j=1 j=1
Li C F Li−1 C C
= −XXXWcfF [yij = c]xijf + XXX WccT0[yij = c][yij+1 = c0]
j=1 c=1 f=1 j=1 c=1 c0=1
The joint probability of yi and xi is given by the Gibbs distribution for the model.
y x
However, as the name implies, a conditional random field is not trained to maximize the joint likelihood of x and y. Instead, the model is trained to maximize the conditional likelihood of y given x similar to a discriminative classifier like logistic regression. The conditional probability of y given x is shown below. Note that the partition function ZW(xi) that results from conditioning on a sequences of feature vectors xi will generally be different for each sequence xi.
,xi))
y
1. (Basics: To begin, implement the following basic methods. While we will only experiment with one data set, your code should be written to work with any label set and any number of features.
1.1. ( Implement the function get_params, which returns the current model parmeters.
1.2. ( Implement the function set_params, which sets the current model parmeters.
1.3. (Implement the function energy, which computes the joint energy of a label and feature sequence y and x.
2. Inference: Efficient inference is the key to both learning and prediction in CRFs. In this question you will describe and implement inference methods for chain-structured CRF.
2.1. (Explain how factor reduction and the log-space sum-product message passing algorithms can be combined to enable efficient inference for the single-node distributions PW(yj|x) and the pairwise distribution PW(yj,yj+1|x). These distributions are technically conditional marginal distributions, but since we are always conditioning on x in a CRF, we will simply refer to them as marginal, and pairwise marginal distributions. Your solution to this question must have linear complexity in the length of the input, and should not numerically underflow or overflow even for long sequences and many labels. (report)
2.2. ) Implement the function log_Z, which computes the log partition function for the distributin PW(y|x). (code)
2.3. (Implement the function predict_logprob, which computes the individual PW(yj|x) and pairwise marginals PW(yj,yj+1|x) for each position in the sequence. (code)
3. (Learning: In this problem, you will derive the maximum likelihood learning algorithm for chain-structured conditional random field models. Again, this algorithm maximizes the average conditional log likelihood function , not the average joint log likelihood, but the learning approach is still typically referred to as maximum likelihood.
3.1. (Write down the average conditional log likelihood function for the CRF given a data set consisting of N input sequences xi and label sequences yi in terms of the parameters and the data. (report)
3.2. (Derive the derivative of the average conditional log likelihood function with respect to the feature parameter WcfF . Show your work. (report)
3.3. (Derive the derivative of the average conditional log likelihood function with respect to the transition parameter . Show your work. (report)
3.4. (Explain how the average conditional log likelihood function and its gradients can be efficiently computed using the inference method you developed in the previous question. (report)
3.5. (Implement the function log_likelihood to efficiently compute the average conditional log likelihood given a set of labeled input sequences. (code)
3.6. ) Implement the function gradient_log_likelihood to efficiently compute the gradient of the average conditional log likelihood given a set of labeled input sequences. (code)
3.7. (Use your implementation of log_likelihood and gradient_log_likelihood along with a numerical optimizer to implement maximum (conditional) likelihood learning in the fit function. The reference solutions were computed using the fmin_bfgs method from scipy.optimize using default optimizer settings. It is recommended that you use this optimizer and the default settings as well to minimize any discrepancies.
(code)
4. Prediction: To use the learned model to make predictions, implement the function predict. Given an unlabeled input sequence, this function should compute the node marginal PW(yj|x) for every position j in the label sequence conditioned on a feature sequence x, and then predict the marginally most likely label. This is called max marginal prediction. (code).
5. Experiments: In this problem, you will use your implementation to conduct basic learning experiments. Add your experiment code to experiment.py
5.1. Use your CRF implementation and the first 100, 200, 300, 400, 500, 600, 700, and 800 training cases to learn eight separate models. For each model, compute the average test set conditional log likelihood and the average test set prediction error. As your answer to this question, provide two separate line plots showing average test set conditional log likelihood and average test error vs the number of training cases. (report)
5.2. Using your CRF model trained on all 800 data cases, conduct an experiment to see if the compute time needed to perform max marginal inference scales linearly with the length of the feature sequence as expected. You should experiment with sequences up to length 20. You will need to create your own longer sequences. Explain how you did so. You should also use multiple repetitions to stabilize the time estimates. As your answer to this question, provide a line plot showing the average time needed to perform marginal inference vs the sequence length. (report)