Project Description For this project, words are represented as lists of lower case symbols, e.g., the word “class" is represented as '(c l a s s ). Paragraphs are lists of words, e.g.,'((c l a s s) (i s) (r e a d y) (f o r) (w o r k)). A document is a list of paragraphs.
A document encoded with a cipher alphabet is not easy to break without the help of a computer.
You are asked to implement two different methods to break the cipher,
1. a brute force version that uses a spell checker, and
2. a version that uses knowedge about the distribution of particular letters in the English language (frequency analysis).
You will need to write two functions Gen-Decoder-A and Gen-Decoder-B that take as input a paragraph of encoded words, and return as output a function that takes as input an encoded word, and returns the plain text of the decoded word. This function can then be used to decode the entire document which may consist of multiple paragraphs. The function Code-Breaker takes an encoded document and a decoding function as input, and returns the entire document in plain text.
The two decoder functions implement different strategies to break the encrypted code. Both have advantages and disadvantages. Your Clojure program will provide an infrastructure to assess the effectiveness of these decoding approaches for different document types.
Brute Force with Spell Checker Gen_Decoder_A The algorithm for the brute force code breaker is simple. The input words are encoded for each possible mapping. There are 28! such mappings. For each mapping, a spell checker determines whether the resulting words are words in the English language. The mapping that generates the most correctly spelled words is assumed to be correct one.
For this method, you will need to implement a spell checker. You can implement your spell checker using the dictionary of words in file dictionary.cl. You will need to implemented the spell checker spell_checker that takes as input a word, and returns the truth value #t or #f. You have two options for this:
• spell_checker_0: A brute force version of the spell checker may just see whether the word occurs in the dictionary or not.
• spell_checker_1: Implements a faster search strategy of your choice.
Frequency Analysis Gen_Decoder_B This code breaker is based on the fact, that in the English language some letters occur more often than others. Knowing the distribution of the different letters, this method just counts the number of occurrences of each letter in the encoded words. If there are enough words to be statistically relevant, the letter distribution can be used to identify the most common letters in English, namely 'e', 't', 'a', ‘o’, ‘i’ and ‘n’. In other words, the assumption is that the most frequent encoded letter has to correspond to the letter 'e', etc. This means that one character in the mapping is decided. One can continue on the second and third letters similarly. You are asked to implement two types of this decoder.
• Gen_Decoder_B_0: Assume that the most frequent six letters are (in the correct order) 'e', 't', 'a', ‘o’, ‘i’ and ‘n’. After doing the frequency analysis for these six letters and establishing the mapping, you are asked to determine the rest of the letters using a brute force method.
• Gen_Decoder_B_1: Implements a faster strategy. It assumes that the first six most frequent letters are the same as above. But this thime, we decide on the other letters by looking at the co-occurance of letters. For example, “is” occurs much more frequent than
“ig”.
Implementation For the implementation of these functions, you can use standard built-in Clojure functions such as first etc. You must not use any global variables.
How to Get Started Copy the files decode.clj, include.clj, dictionary1.txt, dictionary2.txt, document1.txt and document2.txt into your own subdirectory. You will implement your code breaker in file decode.clj. File dictionary1.txt contains a small dictionary consisting of only a few words. Use it to debug your program. File dictionary2.txt contains a list of over 45,000 words allowing you to generate a realistic spell checker. You must not change this file. Finally, file document1.txt and document2.txt contains two sample documents to test your encoder, decoder, etc. You should use your own test cases as well.