Huffman's algorithm for assigning codewords to symbols

Input: An alphabet A of "symbols" (of which some "source text" is composed), and a mapping f : A → ℕ (ℕ = set of natural numbers) where, for each symbol a ∈ A, f(a) is the "frequency" with which symbol a occurs in the source text.

Output: a mapping g : A → BS (BS = set of all bit strings) associating a (binary) codeword to each member of alphabet A.

Algorithm:

Q := empty priority queue (of weighted binary trees; lower weight equals
                                                    higher priority)
do for each a ∈ A
   t := one-node tree with label a and weight f(a)
   Q.insert(t)   --insert one-node tree t into Q
od

do |A| - 1  TIMES
   t1 := Q.deleteMin();     // extract from Q the two trees 
   t2 := Q.deleteMin();     // having lowest weights
   t := binary tree whose root is a new node having as left and right
        children the roots of t1 and t2.  The weight of t is taken to
        be the sum of the weights of t1 and t2.
   Q.insert(t)              // insert t into Q
od

// At this point, Q contains a single tree, the leaves of which are
// precisely the |A| nodes with which we began.

t := lone tree in Q
perform depth first search in t so as to compute g(a) for each a in A:
   Label each edge of t by either 0 or 1, according to whether it goes
   to a left child or a right child, respectively.  Then g(a) corresponds
   to the sequence of labels on the edges along the path from the root of
   t to the leaf node labeled a. 


Example

SymbolFrequency
a (000)50
b (001)17
c (010)9
d (011)24
e (100)60
f (101)13
g (110)4
h (111)6
Suppose we have a string/file/message containing occurrences of eight different symbols (e.g., corresponding to the bit strings of length three). Call them a, b, c, ..., h. To the right is a table showing their frequencies.

Huffman's Algorithm would begin by constructing a forest of eight trees, each having a single node, and a priority queue filled with those trees:

Initially
ForestPriority Queue
 *    *    *    *    *    *    *    *
50   17    9   24   60   13    4    6
 a    b    c    d    e    f    g    h
4{g} 6{h} 9{c} 13{f} 17{b} 24{d} 50{a} 60{e}

During the first loop iteration, the two "lightest" trees will be connected to a new parent node and replaced on the priority queue with the tree rooted at that new parent node. That will leave this:

After 1st iteration
ForestPriority Queue
                                 10
                                 /\
                                /  \
 *    *    *    *    *    *    *    *
50   17    9   24   60   13    4    6
 a    b    c    d    e    f    g    h
9{c} 10{g,h} 13{f} 17{b} 24{d} 50{a} 60{e}

The following iterations will go like this:

After 2nd iteration
ForestPriority Queue
                               19
                               /\
                              /  \
                             /   10
                            /    /\
                           /    /  \
 *    *    *    *    *    *    *    *
50   17   24   60   13    9    4    6
 a    b    d    e    f    c    g    h
13{f} 17{b} 19{c,g,h} 24{d} 50{a} 60{e}

After 3rd iteration
ForestPriority Queue
                               19
                               /\
                              /  \
                  30         /   10
                  /\        /    /\
                 /  \      /    /  \
 *    *    *    *    *    *    *    *
50   60   24   17   13    9    4    6
 a    e    d    b    f    c    g    h
19{c,g,h} 24{d} 30{b,f} 50{a} 60{e}

After 4th iteration
ForestPriority Queue
                             43
                             /\ 
                            /  \
                          19    \
                          /\     \
                         /  \     \
             30         /   10     \
             /\        /    /\      \
            /  \      /    /  \      \
 *    *    *    *    *    *    *      *
50   60   17   13    9    4    6     24
 a    e    b    f    c    g    h      d
30{b,f} 43{c,d,g,h} 50{a} 60{e}

After 5th iteration
ForestPriority Queue
                        73
                       /  \
                      /    \
                     /      \ 
                    /        43
                   /         /\ 
                  /         /  \
                 /        19    \
                /         /\     \
               /         /  \     \
             30         /   10     \
             /\        /    /\      \
            /  \      /    /  \      \
 *    *    *    *    *    *    *      *
50   60   17   13    9    4    6     24
 a    e    b    f    c    g    h      d
50{a} 60{e} 73{b,c,d,f,g,h}

After 6th iteration
ForestPriority Queue
                        73
                       /  \
                      /    \
                     /      \ 
                    /        43
                   /         /\ 
                  /         /  \
                 /        19    \
                /         /\     \
               /         /  \     \
  110        30         /   10     \
   /\        /\        /    /\      \
  /  \      /  \      /    /  \      \
 *    *    *    *    *    *    *      *
50   60   17   13    9    4    6     24
 a    e    b    f    c    g    h      d
73{b,c,d,f,g,h} 110{a,e}

After 7th iteration
ForestPriority Queue
  
                  183
                 /   \
                /     \
               /       \
              /         73
             /         /  \
            /         /    \
           /         /      \ 
          /         /        43
         /         /         /\ 
        /         /         /  \
       /         /        19    \
      /         /         /\     \
     /         /         /  \     \
  110        30         /   10     \
   /\        /\        /    /\      \
  /  \      /  \      /    /  \      \
 *    *    *    *    *    *    *      *
50   60   17   13    9    4    6     24
 a    e    b    f    c    g    h      d
183{a,b,c,d,e,f,g,h}

Labeling edges going to left children by 0 and those going to right children by 1, we end up with the following mapping from symbols to binary codewords:
SymbolFrequencyCodeword
a (000)5000
b (001)17100
c (010)91100
d (011)24111
e (100)6001
f (101)13101
g (110)411010
h (111)611011

Now we calculate how many bits would be needed to encode the original string/file/message using these codewords. To do so, we multiply each symbol's frequency by the length of its codeword. That is, letting freq(x) be the frequency of symbol x and len(x) be the length of its codeword, we compute the sum

Σx=a,b,...,h freq(x)·len(x)

For our example, we get

(50)(2) + (17)(3) + (9)(4) + (24)(3) + (60)(2) + (13)(3) + (4)(5) + (6)(5)

This comes out to 468. The "uncompressed" version of the file has length 549 (183 occurrences of symbols, each encoded by a bit string of length three). So the compression ratio is 468/549, which is approximately 85%. That is, the compressed version of the file is about 85% of the size of the uncompressed version. That's not a very good compression ratio, but that's because you typically aren't going to get a good ratio when the character set is small, unless one or two characters supply a very large percentage of the total frequencies. Huffman compression tends to give a much better ratio when the character set is larger and the range of frequencies is large.