$25
Sheet VI
Long-Range Dependencies 1) Correlation Function (2 points)
Words in natural languages exhibit both short-range local dependencies as well as long-range global dependencies to various degrees. To analyse these dependencies, the temporal correlation function can be used. The correlation between two words w1 and w2 is defined as
(1)
where PD(w1,w2) is the probability of observing w2 after w1 separated by a distance D, estimated from a (continuous) text corpus, and P(w1) and P(w2) are the unigram probabilities of w1 and w2, respectively, estimated from the same text corpus. The term PD(w1,w2) can be defined as
(2)
where ND(w1,w2) is the total number of times w2 was observed after w1 at distance D and N is the total number of word tokens in the text corpus.
(a) Pre-process the corpus English_train.txt by applying white-space tokenization and removing punctuation. You should ignore sentence boundaries. You can use the tokenize function provided in the script.
(b) Implement a function correlation that takes two words, w1 and w2, and distance D as inputs and returns as an output the correlation value.
(c) Use the correlation function and compute it for the word pairs ("he","his"),
("he","her"), ("she","her"), ("she","his") with different distances of D ∈ {1,2,..,50}. Plot the correlation vs. distance with the values obtained. Explain your findings.
Your solution should contain the plot for (c), a sufficient explanation of the findings in this exercise, as well as the source code to reproduce the results.
Language Models
1) Linear Interpolation (2 points)
In this exercise, you will explore simple linear interpolation in language models as an extension of the Lidstone smoothing you already implemented in assignment 3.
In the case of a higher-order ngram language model, there is always the problem of data sparseness. One way to solve this problem is by mixing the higher-order ngram estimates with lower-order probability estimates. This interaction is called interpolation. A simple way to do this is as follows:
PI(w2|w1) = λ1P(w2) + λ2P(w2|w1) (3)
and
PI(w3|w1,w2) = λ1P(w3) + λ2P(w3|w2) + λ3P(w3|w1,w2) (4) where 0 ≤ λi ≤ 1 and Pi λi = 1
(a) First, implement a function to estimate interpolated bigram probabilities. Your implementation of this exercise can be an extension of your code in assignment 3 according to formula (3). Use the Lidstone smoothing technique that you implemented in Assignment 3 to obtain P(w2) and P(w2|w1) where
(5)
and
(6)
(b) Create a bigram language model using interpolation with λ1,2 = 1/2,1/2 and α = 0.3 on the English_train.txt corpus.
(c) Compute the perplexity of the model on the English_test.txt corpus and compare it with the perplexity of the original model with Lidstone smoothing (α = 0.3) and without interpolation. How do you explain the differences?
Your solution should contain the perplexity values for part (c), a sufficient explanation of the findings in this exercise, as well as the source code to reproduce the results.
2) Absolute Discounting Smoothing (4 points)
In this exercise, you will implement absolute discounting smoothing for a bigram language model with a back-off strategy. In contrast to interpolation, we only “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram. When the higher-order n-gram model is a bigram LM, absolute discounting is defined as
(7)
(8)
where 0 < d < 1 is the discounting parameter, Punif is a uniform distribution over the vocabulary V, λ(wi−1) and λ(.) are the backing-off factors defined as
(9)
where
N1+(wi−1•) = |{w : N(wi−1w) > 0}| (10)
and
(11)
where
N1+ = |{w : N(w) > 0}| (12)
(a) Given the equations above, implement absolute discounting smoothing and build a bigram LM using the English_train.txt corpus. Your implementation of this exercise can be an extension of the ngram_LM class or your could design your own code from scratch if that is more convenient for you. The main adaptation should be on the estimate_smoothed_prob function. For a unigram LM, consider the empty string as the n-gram history. You can use the test function to make sure that your probabilities sum up to 1.
(b) Compute and report the perplexity of the backing-off LM for the test corpus with d = 0.7.
(c) Compare the perplexity values of the model with the best-performing model you implemented with Lidstone smoothing. Explain the difference in performance by comparing the two methods.
(d) Investigate the effect of the discounting parameter d on the perplexity of the test corpus. For each d ∈ {0.1,0.2,...,0.9} compute the perplexity and plot a line chart where the x-axis corresponds to the discounting parameter and the y-axis corresponds to the perplexity value.
Your solution should contain the plot for (d), a sufficient explanation of the findings in this exercise, as well as the source code to reproduce the results.
3) Kneser-Ney Smoothing (2 points)
In this exercise, you will explore Kneser-Ney smoothing. Consider the following notation
• V - LM Vocabulary
• N(x) - Count of the n-gram x in the training corpus
• N1+(•w) where |{u : N(u,w) > 0}| - number of bigrams ending in w
• N1+(w•) where |{u : N(w,u) > 0}| - number of bigrams starting with w
• N1+(•w•) where |{(u,v) : N(w,w,v) > 0}| - number of trigram types with w in the middle
For a trigram Kneser-Ney LM, the probability of a word w3 is estimated as
(13)
(14)
As shown above, the definition of the highest order probability in Kneser-Ney LM is identical to that in absolute discounting. However, the main difference is in estimating the lower-order back-off probabilities
(15)
(16)
The base case of this recursive definition is the unigram probability which is estimated as
(17)
where N1+(••) is the number of bigram types. Contrary to absolute discounting, a Kneser-Ney LM does not interpolate with the zerogram probability. In case w3 was not in the vocabulary of the LM (i.e., OOV), then PKN(w3|w1w2) = PKN(w3) = |V1|
(a) From the English_train.txt corpus, collect the necessary statistics and fill the table below
w = "longbourn"
w = "pleasure"
N(w)
Discuss the differences.
(b) Describe the idea behind the Kneser Ney smoothing technique. How does it differ from other backing-off language models?
Your solution should include the table and the source code of part (a), and a sufficient explanation of part (b).