Starting from:

$30

SENG265-Assignment 2 Lempel-Ziv-Welch Compression and Decompression Algorithm Solved

•                  Use the C programming language to write your implementation of the

Lempel-Ziv-Welch compression/decompression algorithm, in a file named “LZW.c”

•                  Test your code against the provided test data, and other test data you create, to ensure it’s working as expected. The instructor will also be testing your program against test data that isn’t provided to you. Think creatively about how to “break” and strengthen your code!

 

This assignment: “LZW.c”
 

In this assignment, we will implement Lempel–Ziv–Welch (LZW), a universal, lossless data compression algorithm. LZW is the algorithm of the Unix file compression utility, compress, and is used in the GIF image format. “Lossless” means that the compression operation preserves 100% of the information in the original format: with a correctly implemented algorithm, you can always recover the identical original data given its compressed form, even though the compressed form ​may​ be significantly smaller.

 

While the (glory) days when you could develop a company based on a compression algorithm are long-gone, there is still lasting practical value in knowing how to manipulate files of all types while implementing algorithms of increasing complexity, not to mention continuing to gain proficiency with array-based implementations -- we can achieve all of the above with a deeper dive into compression via LZW. 

 

Broadly speaking, compression is a process of recognizing recurring patterns in some piece of information to be communicated, and replacing multiple duplicate patterns with shorter references to entries in a list of such patterns. For example, imagine you have a list of numbers on a piece of paper, and you need to read it to a friend over the phone. As you’re going through the list, if you see the same number repeated enough times, you’re likely to say something like “48, 3011 ten times, 189, …” and rely on a shared understanding of natural language grammar to distinguish between the in-band data (the numbers) and the out-of-band data (the instructions about patterns). Unfortunately, digital data doesn’t have a “universal grammar” --

general-purpose data compression algorithms that are expected to operate on any type of data must be able to learn the repeated sequences in particular files automatically. As in life, “There ain't no such thing as a free lunch”. For files that contain random data, or patterns that are too diverse or too sparsely distributed, a general-purpose compression algorithm might even produce “compressed” data that’s larger than the original!

 

LZW is a method for adaptive / dynamic compression - it starts with an initial model, reads data piece by piece and then updates the model while encoding the data as it proceeds through the input file. LZW can be used for binary and text files - it works on any type of file -- that said, it is known for performing well on files or data streams with (any type of) repeated substrings (e.g. text files, which often contain frequently used words like “an”, “the”, “there”, etc.)​. 

 

With English text, the algorithm can be expected to offer 50% compression levels or greater. Binary data files, depending on structuring are capable of decent compression levels as well -  for this assignment, file size is not part of the grading criteria--in fact, the evaluation relies on your code producing exact replicas of the expected compressed files so don’t get ​too​ creative!

 

As in the previous assignment, we are interested in “round-trip” encoding and decoding, translating from the original format to encoded and back to original. I hope that some of the lessons from the previous assignment will become useful sources for your reuse. While the compression technique may be new, other details remain similar. Implementing a more complex algorithm shows how we might layer on complexity once we establish base skills.

 

Background Information
 

Dictionary Data Structure
 

A dictionary​​     is a general-purpose data structure for storing a group of objects. A dictionary has a set of keys and each key has a single associated value - the elements in this abstract data structure are commonly called, key-value pairs​​           . Dictionaries are usually implemented using hash tables, and are often known as “dict”, “hash”, “map”, or sometimes “hashmap” in different programming languages.

 

Binary Coding Schemes
The “LZW​​         ” algorithm described by Welch's 1984 article [1] encodes variable-length sequences of 8-bit bytes as fixed-length 12-bit codes.

 

As an example of a binary code, this toy coding scheme uses three-bit codes to represent a choice of eight possible symbols (the letters ‘A’ through ‘H’). The choice of symbols is arbitrary--we could have used eight different emoji. As long as the sender and receiver have a shared understanding of what the codes mean, they can communicate using compressed data.

 

binary code
symbol
000
A
001
B
010
C
011
D
100
E
101
F
110
G
111
H
 

In a binary code, each additional bit added to the code scheme doubles the number of distinct symbols that can be encoded--for an n-bit code, 2​n​ different symbols can be represented. In the classic LZW algorithm, the choice of a 12-bit coding scheme allows for 4096 distinct symbols with binary codes between 000000000000 and 111111111111. Each binary code can also be represented as three hexadecimal digits, e.g. 000 to FFF.

 

The dictionary data structure to be used for this assignment has a capacity of 4096 entries (2​12​), of which:

-        the first 256 entries, numbered 0 through 255, are reserved for every possible 1-byte value (i.e. entry #0 is for the ASCII NUL character, entry #32 is for the ASCII space character, entry #65 is for the capital letter “A”... look at man ascii​​        ! (press ‘q’ to exit the man page reader))

-        the entries numbered 256 through 4094 are assigned dynamically by the compression algorithm to represent multi-byte sequences in the input data.

-        the last entry, numbered 4095, is reserved for a padding marker

 

The code that refers to an entry in the dictionary is simply the entry’s numeric position, encoded as a 12-bit binary integer.

 

Additional Notes
 

Both encoding and decoding programs refer to the dictionary countless times during the algorithm. You may want to reference the hints in the Provided Materials section of this document.

 

The encoder looks up indexes in the dictionary by using strings. That is, to aid the look-up of indexes, we use a dictionary of strings - i.e. an array of strings. With each string having some maximum number of characters, you might consider using as your data structure, a multidimensional array -- as discussed in Unit 3, Slide 29, this has a form similar to: e.g., X[row][column], recall that we access elements of the array through integer indexes.

 

Why select such an approach? A data structure with 0(1), “Order-1 time complexity”, can be useful in practice. 

  

 

 

 

 

 

Algorithm and Example
 

LZW Encoding

 

1.     Start with a ​standard initial dictionary​ of 4096 entries (each entry can be identified using a 12-bit number)

2.     Set ​w​ to ​''

3.     As long as there are more characters to read

a.     read a character ​k

b.     if ​wk​ is in the dictionary    set ​w​ to ​wk

else

   output the code for ​w

   add ​wk​ to the dictionary, if it’s not full    set ​w​ to ​k

4.     output the code for ​w

Note #1: Step 2 is:    Set w to ​'' It is not:

   Set w to ​' '

That is, you start off with w set to nothing, not a space character. So if you append 't' to '', you end up with 't' (not ' t').

Note #2: The first 256 entries (0-255) in the LZW dictionary should be the 1-byte codes whose values are 0-255. That is,    dict[42][1] is 42 (and dict[42][0] is 1)    dict[228][1] is 228 (and dict[228][0] is 1)

In general, for 0 <= i <= 255: dict[i][1] is i (and dict[i][0] is 1)
 

LZW Decoding

 

1.     Start with a ​standard initial dictionary​ of 4096 entries (the first 256 are standard ASCII, the rest are empty).

2.     Read a code k​ ​ from the encoded file

3.     Output dict[​k​]

4.     Set ​w​ to dict[​k​]

5.     As long as there are more codes to read in the input file

a.     read a code k​

b.     if ​k​ is in the dictionary    output dict[​k​]

   add ​w​ + first character of dict[​k​] to the dict else

                                add w​ ​ + first char of w​ ​ to the dict and​ ​ output it

c.      Set ​w​ to dict[​k​]
 

The programs LZW.c takes as its first two inputs, the name of an input file as the first command line argument, and the name of an output file as the second command line argument.

If the second passed command line argument is an ‘e’, then the program creates a new file with the same name as the file in the command line, but with the extension .LZW. The new file contains the LZW encoding of the original file.

If the second passed command line argument is a ‘d’, then the program takes the name of compressed file as a command line argument (a file ending in extension, .LZW). The program interprets the original file name by dropping the .LZW file extension and then decodes the LZW encoding back to the “original.” (actually a copy).

The program should open both the input file and output file as binary files and make no assumption about what byte values appear in the files. Here's what their behaviour looks like:

 

~ /assignment2 cat tests/encode_short_string/input.txt this_is_his_history

 

~ /assignment2 ./tools/b2x ./tests/encode_short_string/input.txt

746869735F69735F6869735F686973746F72790A

 

~ /assignment2 ./LZW ./tests/encode_short_string/input.txt e

 

~ /assignment2 ./tools/b2x ./tests/encode_short_string/input.txt.LZW

07406806907305F10205F10110310707406F07207900AFFF

# different hex code means the input file and the encoded file are different

 

~ /assignment2 diff ./tests/encode_short_string/compare.txt.LZW

./tests/encode_short_string/input.txt.LZW

# no output means the files are the same

 

# rename our file so we don’t overwrite it (harder to determine if program executed correctly) mv ./tests/encode_short_string/input.txt ./tests/encode_short_string/input.old

 

# now let’s decode our compressed file, and make sure it matches the original input ~ /assignment2 ./LZW ./tests/encode_short_string/input.txt.LZW d

 

~ /assignment2 cat ./tests/encode_short_string/input.txt this_is_his_history

 

~/assignment2 diff ./tests/encode_short_string/input.txt ./tests/encode_short_string/input.old # no output means they’re the same!

 

~ /assignment2  

 
 

This session shows the contents of the file ​input.txt​. The file contains the string ​this_is_his_history​.

The hexadecimal values of the ASCII codes are also shown. Running the program ​LZW​ (with the encode flag) on ​input.txt produces a file input.ext.LZW​. The file ​input.txt.LZW​ contains 12-bit values for the LZW codes:

074 = t

068 = h

069 = i

073 = s

05F = _

102 = is

05F = _

101 = hi 103 = s_

etc.). The decode version of the program, LZW <filename d, then correctly recovers the original file name from the given input, dropping .LZW from the input file name, and recovers the original data.

 

Requirements
The requirements for this assignment are as follows:

 

1.     Use arrays (and not dynamic memory allocation!) as an aggregate data type for file processing (encoding and decoding).

2.     Create a ​main  ​function that:

a. Takes two command line arguments:

i.            A string indicating the name of the file to operate on (the “input” file);

ii.           A one-character flag indicating whether the program should operate in encoding or decoding mode. Use ‘e’ for encoding mode, or ‘d’ for decoding mode.

 

If the second argument is not provided, or is not either ‘e’ or ‘d’, iii. print “Invalid Usage, expected: LZW {input_file} [e | d]” to the console

                                     iv.    terminate the program with exit code 4.

 

b. Opens the input file indicated by the first argument.

 

If there is no filename specified:

i.           print “Error: No input file specified!” to the console

ii.         terminate the program with exit code 1.

If the input filename specified doesn’t exist or can’t be read for any reason: iii.       print “Read error: file not found or cannot be read” to the console iv. terminate the program with exit code 2.

 

c.      Opens an output file using the naming convention as specified

 

d.     Depending on the value of the second command-line argument, passes the string read from the input file to either the ​encode or ​  ​decode ​function, as described below

 

3.     Create a second function called ​encode with the following characteristics:​            

a.     Takes two C files as function parameters:

i.            an input file

ii.           an output file

 

b.     Encodes the provided file using the algorithm described above.

 

c.      If the file was encoded successfully:

i.            create the encoded representation to the file system ii.         terminate the program with exit code 0, indicating success

 

d.     If the file could not be encoded for any reason (e.g. a particular input fails):

i.            print “Error: File could not be encoded” to the console ii.         terminate the program with exit code

 

4.     Create a third function called ​decode with the following characteristics:​    

a.     Take two C files as function parameters:

i.            an input file

ii.           an output file

 

b.     Decode the provided file using the algorithm described above

 

If your program encounters invalid, corrupted or badly formatted data that cannot be decoded according to the specified algorithm, the program should:

i.            print “Error: Invalid format” to the console ii.           terminate the program with exit code 3.

 

c.      If the file was decoded successfully:

i.            write the resulting decoded representation to a file with the expected name

ii.           terminate the program with exit code 0, indicating success

 

d.     If the file could not be decoded for any reason (other than those stated above):

i.            print “Error: File could not be decoded” to the console

ii.           terminate the program with exit code 5

 

By now, you’re developing your own tests to evaluate your software (we don’t get what we ​expect​, we get what we ​inspect​) -- consequently, we would like to see at least one test input from YOU - use the testing convention described in this assignment - i.e. if your test input is expected to successfully execute, then it should follow the happy path of the algorithm. If the test input is expected to fail, then it should show a brief error message followed by exiting with error code​ 6​.

You may also think of tweaks to the algorithms that will make the encoder/decoder run faster, or make the compression better. If so, implement these ideas, and compare. But, this is not necessary. You are encouraged to devote time to testing your program.

 

Provided Materials
We have created a git repository for you - it contains a starter file, LZW-starter.c, a tool, b2x, a limited test suite, and these assignment instructions. There are also some hints in this section.

 

LZW-starter.c 
 

A “starter kit” for LZW.c for you to reuse in your implementation. Copy it to LZW.c and edit it, or copy and paste parts of it. The starter-file contains a few functions you are invited to use in your solution - these are read12(), write12() and flush12(). It is not your aim to understand the implementation of these functions (although you’re welcome to explore as you see fit) - it IS your aim to understand how you might apply. these functions in your solution - if you choose to apply them, it should have the effect of simplifying the solution. The use of flush12() might have an analogous application to the use of fflush(stdin) - while we’re on the subject: fflush() might have some use in your solution as well.

 

b2x Tool
 

We created a directory /tools and placed a source file, b2x.c, that reads a binary file and writes the “binhex” representation of a file to the screen -- binhex is short for binary-to-hexadecimal - it’s yet another encoding mechanism - you can compile and use this tool if useful (it is not required). You can also use the common unix hexdump utility for similar purposes.

 

Test Suite
 

We will supply a limited suite of test cases that you can use to evaluate your own program, and get a feel for the types of inputs you can expect. You will see the tests include both text and binary files. The test suite has the same structure as the previous assignment - a set of directories, with one test case provided per directory. The directory, as previously, is named according to the type of test - e.g. encode_text or decode_image, where the first word denotes the mode the program should be run in while running this test.

Things are not identical to the previous assignment - for example, in the case of encoding:

●       the filename passed may have ​any​ name and extension, e.g. input.ext​​         (your program is expected to read that file as its input, with ext replaced, e.g. with “txt” or “jpg”) NOTE: Your program ​does not​ need to locate the appropriate file to operate on -- you can look at the files in the test case directory, and pass the input filename as a command line parameter!

●       Continuing the example above, if your file is expected to run successfully given its input, the directory will contain a file whose name is compare.ext.​​               LZW​ where ext​​ is the original uncompressed filename extension

○    NOTE: Why “compare.ext”? -- Well, we give you the input file for testing purposes; if properly run, you should get input.ext.LZW as your output - if we provided you with this properly named file as output, then you could perhaps overwrite it accidentally, which would remove your ability to compare - hence, we prefix the outputs with “compare”.

●       Your program’s output is expected to ​precisely​ (bit-for-bit!) match the contents of this output file.

●       If your program is expected to terminate with a non-zero (unsuccessful) error code when given the provided input file, the directory will contain one of the following:

-        a file named n.​​ err.txt​, where ​n​ is the expected exit code, containing text that your program’s output is expected to precisely match, ​or;

-        a file named n.​​ err​, where n​​ is the expected exit code. In this case your output doesn’t need to match any specific error message, but your program should still exit with error code n​

-        NOTE: for YOUR test(s) that you provide, should the test(s) exit with an error code, the error code must be 6​ ​.

-        In the case of decoding, we provide input.ext.LZW and also provide compare.ext - again, you can derive the resulting output filename from the given input, and then compare accordingly.

 

 

Hints
-        Use an array of strings for the dictionary. Here's an example:

   unsigned char dict[4096][32];

-        That's an array of 4096 strings, with each string having a maximum of 32 characters. That means that your program must be careful not to add to the dictionary once it is full and not to add strings to the dictionary that are too long. The code for each entry in the dictionary is simply its position in the array. For example, the code for the entry dict[267] is 267

 

-        I strongly recommend against using the standard C null-terminated strings in the dictionary (since null might be a valid character in a LZW dictionary entry). Instead, store the length of the string as the first byte of the string like this:

   dict[i][0] = 3;    dict[i][1] = 'a';    dict[i][2] = 'b';    dict[i][3] = 'c';

You will not be able to use the built-in string functions (like strcmp, strlen, etc.) for this type of string.

Process
To get started, the assignment repository has already been set up in your person GitLab space, listed as /assignment2.git. We set this up in our local machine space with git clone​​   :

 

git  clone

https://gitlab.csc.uvic.ca/courses/2019091/SENG265/assignments/<NETLIN K_ID/assignment2.git

 

(all one line, with <NETLINK_ID replaced with your personal Netlink identifier)

 

Note that by cloning from your personal repository, this automatically sets up your remote on GitLab so when you git push​​          ​later, your changes will automatically be visible to both you and the teaching team.

 

Use your favourite text editor to write your program.

 

Once you’re able to compile your program, test it by running it with the input.<ext​​             ​files in the provided test suite, and any other files you can think of. Be creative; try to predict what kinds of inputs will break your code! Try random files lying around on your computer or the internet to see what happens!

 

If it seems like your changes are having no effect, don’t forget to recompile your program after changing it.

 

When you’ve made changes to the file that you want to keep, git commit your​ changes to your local copy of the repository. Git will save the history of past commits. When you want to upload your changes to your repository, use git push​​ .

More products