CMPS 340 File Processing
Encoding the Symbol-to-Codeword Mapping Induced by a Huffman Tree

Under construction

Background: Unary and Elias-Gamma Coding of Positive Integers

Elias-Gamma Coding
n U(lg n+1) B(n) EG(n)
1 1 1 1
2 01 10 010
3 01 11 011
4 001 100 00100
5 001 101 00101
6 001 110 00110
7 001 111 00111
8 0001 1000 0001000
9 0001 1001 0001001
10 0001 1010 0001010
11 0001 1011 0001011
... ... ... ...
15 0001 1111 0001111
16 00001 10000 000010000
... ... ... ...
31 00001 11111 000011111
32 000001 100000 00000100000

The unary code for a positive integer n is n-1 0s followed by a 1. For example, 5 is encoded by 00001. (Of course, we could reverse the roles played by 0 and 1, in which case 5 would be encoded by 11110 instead. Also, if we wanted to be able to encode zero, we could change the scheme to using n (rather than n-1) 0s followed by a 1.) (Because both 0 and 1 are used in this scheme, it seems to be a misnomer to call it unary, but so it is.) Let U(n) denote the unary code for positive integer n.

Notice that the set of codewords in this coding scheme is { 1, 01, 001, 0001, ... } (or 0*1 as a regular expression), which is prefix-free and therefore uniquely decipherable. This is significant, because it means that if we can determine where, within a larger bit string, a unary codeword begins, we can also determine where that codeword ends (namely, at the first occurrence of a 1). When a small positive integer is to be encoded, using its unary code can be an efficient way of doing so.

When a somewhat larger-valued positive integer is to be encoded, a more efficient coding scheme is Elias-Gamma. (For numbers less than five, the unary codewords are slightly shorter than the corresponding Elias-Gamma codewords, but for numbers six and above, Elias-Gamma's advantage grows larger as the numbers increase.) Here, a positive integer n is encoded by U(⌊lg n⌋+1) (i.e., the unary representation of one more than the log to the base two of n) followed by B(n), i.e., the standard binary encoding of n, with the trailing 1 in the former and leading 1 in the latter "merged" into one symbol. To view it another way, EG(n) is B(n) preceded by a number of 0s one less than the length of B(n). Thus, the length of EG(n) is 2⌊lg n⌋ + 1. To the right is a table showing the Elias-Gamma codewords for several numbers up to 32.


Decompressing a Huffman-compressed File

Suppose that a file has been compressed using Huffman coding. When, at some later time, the file is to be decompressed, the decompressor (i.e., the "agent" that carries out the decompression) must somehow ascertain the symbol-to-codeword mapping that was used by the compressor. How?

The obvious answer is that an encoding of this mapping should appear at the beginning of the compressed version of the file. (This is an example of metadata.) That way, the decompressor can begin its work by reading the encoding of the mapping, allowing it to construct a data structure (possibly the same Huffman tree that the compressor used) that enables the efficient translation of codewords into (the native representations of) their corresponding symbols. Once the decompressor has that data structure in place, it continues by reading the compressed data (which is just a sequence of codewords), translating each successive codeword back to the symbol that it represents.

But how should the symbol-to-codeword mapping be encoded?

We start with the most straightforward case, which is when the source alphabet is the set of byte values (i.e., the set of bit strings of length eight). Of course, these byte values naturally map into the integers in the range 0..255 in accord with the standard binary representation of natural numbers. Thus, we identify each symbol in the alphabet with the number that it represents and refer to them as S0, S1, ..., S255. (For example, S57 is 00111001.) Similarly, we refer to the codeword associated to Sk as Ck.

Listing the Codewords

One way to encode the symbol-to-codeword mapping is with the list

C0, C1, C2, ..., C255

Of course, we must allow for the possibility —indeed, the near certainty— that not all codewords will be of the same length. Thus, the encoding must provide some way for the decompressor to find the boundaries between adjacent codewords. That can be accomplished by preceding each Ck by an encoding of its length, Lk. Since the vast majority of codewords will be of length six or greater —can you explain why?— for which Elias-Gamma coding is better than unary, we choose to use the former.

Thus, we can encode the symbol-to-codeword mapping by the bit string

EG(L0)   C0   EG(L1)   C1   EG(L2)   C2   ...   EG(L255)   C255

As an example, suppose that C0 = 0110, C1 = 101110110, and C255 = 1100010110. Then the encoding of the symbol-to-codeword mapping will look like this:

00100 0110 0001001 10111010 ... 0001010 1100010110

Spaces were inserted at the boundaries between length encodings and codewords so that the reader could make sense of this more easily.

How many bits will this encoding require? Of course, that depends upon the distribution of codeword lengths, but, for reasons outside the scope of this document, it is safe to say that a lower bound is 2872 (which is 256×12, describing the impossibly good case in which each of the 256 codewords is of length seven, and thus the encoding of its length is of length five).

Your instructor believes that it can be proved that the best possible case would occur when 85 codewords were of length seven, 170 were of length nine, and the remaining one of length eight. Encoding each codeword of length seven requires 12 bits (seven bits for the codeword itself, preceded by the five bits of EG(7) to encode the codeword's length). Encoding each codeword of length nine requires 16 bits (nine for the codeword itself, preceded by seven bits for EG(9)). Finally, encoding the codeword of length eight requires 15 bits. This adds up to (85)(12) + (170)(16) + 15 = 3755 bits.

Improvement: Encode the Huffman Tree Rather than the Codewords

Tree encoded by 000110001101111
               1
              / \
             /   \
            /     \
           2       15
          / \     
         /   \ 
        /     \
       /       \
      /         \
     3           6 
    / \         / \
   /   \       /   \
  /     \     /     \
 4       5   7       14
            / \ 
           /   \
          /     \
         /       \
        /         \
       8           11 
      / \         / \
     /   \       /   \
    /     \     /     \
   9       10  12      13

Listing all the codewords results in needless redundancy. For example, suppose that 103 of the codewords begin with 0 and the remaining 153 begin with 1. In effect, a 0 (respectively, 1) at the beginning of a codeword encodes the left (respectively, right) child of the root of the Huffman Tree from which the codewords were induced. So in a list of all codewords, those two nodes are being encoded 103 and 153 times, respectively. More generally, in a list of all codewords, each node in the tree will be encoded a number of times equal to the number of leaves in the subtree of which it is the root.

Can we do better? In exploring this question, we can make the observation that every Huffman tree is a full binary tree, meaning one in which every node has either two children or no children. It turns out that we can exploit this regularity to devise a way of encoding the structure of any full binary tree by a bit string whose length equals the number of nodes in the tree.

One such encoding function f is elegantly expressed recursively:

In words, a one-node tree is encoded by 1 and a multi-node tree is encoded by 0xy, where x and y are the encodings of its left and right subtrees, respectively.

To put it yet another way, to produce the encoding of a tree, perform a preorder traversal upon it. Each time an interior node (i.e., a non-leaf) is visited, emit 0. Each time a leaf is visited, emit 1.

As an example, consider the full binary tree pictured to the right, where the nodes are numbered corresponding to the order in which they would be visited in a preorder traversal. The tree is encoded by 000110001101111.

So suppose that T is the Huffman Tree that gives rise to the symbol-to-codeword mapping that the compressor will be using when applied to some particular file (based upon the frequencies with which the various byte values occur in that file, of course). If the compressor writes f(T) at the beginning of the compressed version of the file, the decompressor will be able to reconstruct T. (Algorithmic details will be shown later.)

Well, not quite, because f(T) encodes the structure of T, but it fails to capture the mapping between symbols (byte values) and leaves. That is, from f(T) we can obtain the set of codewords but we cannot tell, for any codeword, to which symbol it is associated. Obviously, the decompressor needs this information, and so the compressor must somehow encode it and include it, together with f(T), within the metadata.

This turns out not to be difficult. What the compressor can do, immediately after (or before, for that matter) writing f(T), is to write all of the Si's in the order that corresponds to the left-to-right order of the leaves to which they are associated. Or, even more convenient for the decompressor, the compressor can embed the Si's within f(T) by placing each one immediately after the 1 (within f(T)) that encodes the corresponding leaf.

All this can be encoded using 511 bits for f(T) plus 2048 bits (256·8) to list all 256 byte values, for a total of 2559 bits, which is almost 1200 bits fewer than the minimum possible arising from listing all the codewords (as seen in the previous secton).

               *
              / \
             /   \
            /     \
           *      110
          / \     
         /   \ 
        /     \
       /       \
      /         \
     *           *
    / \         / \
   /   \       /   \
  /     \     /     \
000    111   *     011 
            / \ 
           /   \
          /     \
         /       \
        /         \
       *           *
      / \         / \
     /   \       /   \
    /     \     /     \
  100    010   101   001

In order to illustrate this (using a tree of reasonable size, rather than one containing 511 nodes), to the right is shown the 15-node tree shown earlier, but this time its leaves are labeled by the bit strings of length three, which we imagine to be the source alphabet of a message to be compressed.

The symbol-to-codeword mapping induced by this tree is

{ 000→000, 001→01011, 010→01001, 011→011, 100→01000, 101→01010, 110→1, 111→001 }

By chance, two of the symbols (000 and 011) have codewords that are identical to their "native representations".

As described above, we can encode both the structure of the tree T and the mapping from symbols to leaves using the string f(T) followed by the list of symbols in left-to-right order by their corresponding leaves. This yields the bit string

000110001101111 000 111 100 010 101 001 011 110

where the black bits encode the tree structure and the blue bits enumerate the (native representations of) the symbols. Spaces are used simply to help the reader separate one logical value from another.

If we choose to embed the symbols within the encoding f(T) of the tree (by placing each symbol's native representation immediately after the 1 in f(T) representing its leaf node), we get the bit string

0001 000 1 111 0001 100 1 010 01 101 1 001 1 011 1 110

Again, the black bits are f(T) and the blue bits enumerate the symbols.

When Not Every Symbol Appears

Another advantage enjoyed by the approach of encoding the Huffman Tree rather than that of listing the codewords is that the former can more easily take advantage of the (very frequent) case in which the source alphabet is a proper subset of the set of all byte values. Indeed, for a typical ASCII text file, only 60-90 distinct byte values appear at all. (The "non-extended" ASCII character set includes no symbols encoded with a byte value greater than 127, which eliminates half of the 256 byte values right from the start.)

Suppose that n distinct byte values appear in a file to be compressed. Then the corresponding Huffman Tree need only contain n leaves, and thus a total of 2n-1 nodes.1 Which means that its structure can be encoded using 2n-1 bits and the list of (native representations of) symbols need only be 8n bits long. Thus, a total of 10n-1 bits suffices to describe the symbol-to-codeword mapping. If, say, n = 100, that's 999 bits, as compared to 2559 bits (as computed above) when the tree has a leaf for every one of the 256 byte values.

Except that we forgot one thing: If we are going to allow the number of bits encoding the symbol-to-codeword mapping to vary according to how many distinct byte values occur in the file to be compressed, won't the decompressor need to be informed of that? Otherwise, it would seem, the decompressor will not be capable of correctly interpreting the metadata, as it won't "know" the metadata's length (and thus won't be able to find the boundary between it and the sequence of codewords following it).

We can fix this by having the compressor write, at the very beginning of the compressed file, (an encoding of) the value of n. (Since n must be in the range 1..256, a single byte can be devoted to storing this piece of meta-metadata.)

But wait!! It turns out that the value of n need not be written explicitly into the compressed file! Why not?
Answer: Because the decompressor can determine where the bit string f(T) ends without knowing, in advance, how long it is!

How is that possible?
Answer: Because the set of strings { f(T) | T is full binary tree } is prefix-free. That is, none of its members is a proper prefix of any other. Moreover, it consists of precisely those bit strings satisfying these two properties:

  1. the number of occurrences of 1 exceeds the number of occurrences of 0 by exactly one, and
  2. in every proper prefix, the number of occurrences of 1 is at most the number of occurrences of 0.

Thus, in any bit string z having a proper prefix that is f(T), for some full binary tree T, f(T) must be the shortest prefix of z in which the number of occurrences of 1 exceeds the number of occurrences of 0!

Although one could literally count 0's and 1's in order to find where f(T) ends, it need not be done that way in practice. The Java method shown below, bitStringToFBT(), scans the bits of f(T) while constructing T without doing any explicit counting. When the last bit of f(T) has been processed, the method's execution will have ended with T having been fully constructed.

Source Alphabets other than {0,1}8

To keep things simple, the above discussion had as its premise that the source alphabet (i.e., the set of symbols occurring in the file to be compressed) was the set of byte values (i.e., bit strings of length eight), or some subset thereof. This is a convenient choice, since it corresponds to the compressor interpreting each successive byte in the file to be an occurrence of a symbol. Alternatively, each byte could be interpreted as being a sequence of two symbols, so that the source alphabet is (some subset of) the set of bit strings of length four (i.e., the "half-bytes", or the hexadecimal digits).

A somewhat more problematic choice would be to chop up the file into bit strings of length k, where either k>8 or k is not a divisor of 8. If the file's length n (in bits) were not divisible by k, then we would have to deal with the n mod k bits at the end of the file as a special case. One possibility would be to identify that "leftover" bit string within the metadata. Then, after the decompressor had mapped the last codeword into its corresponding symbol, it would append the leftover bit string onto the end of its output. Or one could choose to consider the leftover bit string to be one more symbol in the source alphabet and assign a codeword to it.

Another realistic possibility is that the source alphabet consists of symbols whose native representations are of wildly differing lengths. For example, it would make sense, in compressing a text file containing English prose, to choose to include full words (or even word phrases) among the symbols (e.g., "the", "at", "computer", "I", "in the", etc.)

To encode the symbol-to-codeword mapping in such a case, we could use the same approach as described above, in which structure of the Huffman Tree is described by a bit string followed by (or interspersed with) the native representations of the symbols in the source alphabet. What would be different is that each symbol's native representation would need to be preceded by a length indicator.

Assuming that the symbols are chosen so that each one's native representation is an integral number of bytes, the length indicators could be in byte units rather than bit units. Thus, for example, the length of the native representation of "dog" would be 3 (bytes) rather than 24 (bits). This is important, because the length of EG(n) is six less than the length of E(8n).

But why are we assuming that the length indicators are best encoded using the Elias-Gamma scheme, as opposed to the unary scheme? Well, because the only values that are encoded using fewer bits in the latter scheme are two and four, and even in these cases the advantage is only a single bit. Thus, unless the vast majority of native representations have length two or four, Elias-Gamma is a better choice.


Java Code

What follows is Java code that shows how to translate between a full binary tree and its encoding, going in both directions.

public abstract class FullBinaryTree {

   // Forms a single-node tree.
   public abstract FullBinaryTree();

   // Forms a tree having a root with the specified left and right subtrees.
   public abstract FullBinaryTree(FullBinaryTree left, FullBinaryTree right); 

   // Reports whether or not the tree has only one node.
   public abstract boolean isLeaf();

   // Returns the tree's left subtree (assumes !isLeaf()).
   public abstract FullBinaryTree leftSubtree(); 

   // Returns the tree's right subtree (assumes !isLeaf()).
   public abstract FullBinaryTree rightSubtree();
}

public abstract class BitString {

   // Forms a bit string of length zero.
   public abstract BitString();

   // Forms a bit string as described by s, each character in which
   // is assumed to be either '0' or '1'.
   public abstract BitString(String s);

   // Appends the given bit string to this one and returns a self-reference.
   public abstract BitString append(BitString bitStr);  

   // Appends the given bit (which is assumed to have value 0 or 1) to this
   // bit string, and returns a self-reference.
   public abstract BitString append(int bit);  

   // Returns an iterator that can iterate through the bits of this bit string.
   public abstract BitIterator iterator();
}

public interface BitIterator {

   // Forms an object that can iterate over the specified bit string.
   public abstract BitIterator(BitString bitStr);

   // Returns the next bit of the bit string.
   public abstract int nextBit();

   // Reports whether or not there is a next bit.
   public absract boolean hasNextBit();
}

Assume that BitStringX and FullBinaryTreeX are concrete classes that are descendants of the abstract classes shown above. To encode a full binary tree as a bit string, we can use this recursive method:

public BitString fullBinTreeToBitString(FullBinaryTree t) {

   BitString result = new BitStringX();     // empty bit string
   if (t.isLeaf()) { result.append(1); }   
   else {
      BitString left = fullBinTreeToBitString(t.leftSubtree());
      BitString right = fullBinTreeToBitString(t.rightSubtree());
      result = result.append(0).append(left).append(right);
   }
   return result;
}

Here we show how to construct a full binary tree from its encoding, assuming that BitIteratorX is a class that implements the BitIterator interface shown above:

public FullBinaryTree bitStringToFullBinTree(BitString bitString) {

   return bitStringToFBT(new BitIteratorX(bitString));
}

private FullBinaryTree bitStringToFBT(BitIterator bitIter) {

   FullBinaryTree result;
   int bit = bitIter.nextBit();
   if (bit == 1)
      { result = new FullBinaryTreeX(); }    // one-node tree
   else {
      FullBinaryTree left = bitStringToFBT(bitIter);
      FullBinaryTree right = bitStringToFBT(bitIter);
      result = new FullBinaryTreeX(left, right);
   }
   return result;
}


Footnotes

[1] Recall that, in a full binary tree, the number of interior nodes is one fewer than the number of leaves, so a full binary tree with n leaves has n + (n-1) = 2n-1 nodes in total.