CMPS 144L Lab Activity
Huffman Coding

Symbol Frequency    Codeword   
a (000)26%
b (001)6%
c (010)2%
d (011)15%
e (100)28%
f (101)8%
g (110)4%
h (111)11%
 
Suppose that we wish to compress a file and that we are interpreting its contents as being a sequence of 3-bit blocks, which is to say that we are viewing it as being a sequence composed of the "symbols" in the set

{000, 001, 010, 011, 100, 101, 110, 111}

consisting of all the bit strings of length three. For the sake of convenience, we will use the names a through h to refer to these eight symbols, respectively. (Also for the sake of convenience, we will assume that the file's length, in bits, is divisible by three.)

The frequencies with which the symbols occur in the file is given by the table to the left.

(a) Build a Huffman Tree based upon the given frequencies.

(b) Now use that Huffman Tree to assign codewords to the symbols a through h, and fill in the third column of the table accordingly.

(c) Compute the ratio between the lengths of the resulting compressed version of the file and the original file. Of course, this involves computing the average number of bits, per symbol occurrence, that would be used in encoding the file using the codewords. We obtain that by multiplying each symbol's frequency by the length of its codeword, and adding the resulting numbers. In mathematical notation, it is given by the expression

 Σ (freq(x)·len(x))
x=a,b,...h

where freq(x) is the frequency of symbol x and len(x) is the length of the codeword for x. (This is an example of an expected value, as was (or will be) discussed in the context of hashing.)