Huffman's algorithm for assigning codewords to symbols

Input: An alphabet A of "symbols" (of which some "source text" is composed), and a mapping f : A → ℕ (ℕ = set of natural numbers) where, for each symbol a ∈ A, f(a) is the "frequency" with which symbol a occurs in the source text.

Output: a mapping g : A → BS (BS = set of all bit strings) associating a (binary) codeword to each member of alphabet A.

Algorithm:

Q := empty priority queue (of weighted binary trees; lower weight equals higher priority) do for each a ∈ A t := one-node tree with label a and weight f(a) Q.insert(t) --insert one-node tree t into Q od do |A| - 1 TIMES t1 := Q.deleteMin(); // extract from Q the two trees t2 := Q.deleteMin(); // having lowest weights t := binary tree whose root is a new node having as left and right children the roots of t1 and t2. The weight of t is taken to be the sum of the weights of t1 and t2. Q.insert(t) // insert t into Q od // At this point, Q contains a single tree, the leaves of which are // precisely the |A| nodes with which we began. t := lone tree in Q perform depth first search in t so as to compute g(a) for each a in A: Label each edge of t by either 0 or 1, according to whether it goes to a left child or a right child, respectively. Then g(a) corresponds to the sequence of labels on the edges along the path from the root of t to the leaf node labeled a.

Example

Symbol Frequency

a (000) 50

b (001) 17

c (010) 9

d (011) 24

e (100) 60

f (101) 13

g (110) 4

h (111) 6

Suppose we have a string/file/message containing occurrences of eight different symbols (e.g., corresponding to the bit strings of length three). Call them a, b, c, ..., h. To the right is a table showing their frequencies.

Symbol	Frequency
a (000)	50
b (001)	17
c (010)	9
d (011)	24
e (100)	60
f (101)	13
g (110)	4
h (111)	6

Huffman's Algorithm would begin by constructing a forest of eight trees, each having a single node, and a priority queue filled with those trees:

Initially
Forest Priority Queue

* * * * * * * * 50 17 9 24 60 13 4 6 a b c d e f g h
4{g} 6{h} 9{c} 13{f} 17{b} 24{d} 50{a} 60{e}

Initially
Forest	Priority Queue
* * * * * * * * 50 17 9 24 60 13 4 6 a b c d e f g h	4{g} 6{h} 9{c} 13{f} 17{b} 24{d} 50{a} 60{e}

During the first loop iteration, the two "lightest" trees will be connected to a new parent node and replaced on the priority queue with the tree rooted at that new parent node. That will leave this:

After 1st iteration
Forest Priority Queue

10 /\ / \ * * * * * * * * 50 17 9 24 60 13 4 6 a b c d e f g h
9{c} 10{g,h} 13{f} 17{b} 24{d} 50{a} 60{e}

After 1st iteration
Forest	Priority Queue
10 /\ / \ * * * * * * * * 50 17 9 24 60 13 4 6 a b c d e f g h	9{c} 10{g,h} 13{f} 17{b} 24{d} 50{a} 60{e}

The following iterations will go like this:

After 2nd iteration
Forest Priority Queue

19 /\ / \ / 10 / /\ / / \ * * * * * * * * 50 17 24 60 13 9 4 6 a b d e f c g h
13{f} 17{b} 19{c,g,h} 24{d} 50{a} 60{e}

After 2nd iteration
Forest	Priority Queue
19 /\ / \ / 10 / /\ / / \ * * * * * * * * 50 17 24 60 13 9 4 6 a b d e f c g h	13{f} 17{b} 19{c,g,h} 24{d} 50{a} 60{e}

After 3rd iteration
Forest Priority Queue

19 /\ / \ 30 / 10 /\ / /\ / \ / / \ * * * * * * * * 50 60 24 17 13 9 4 6 a e d b f c g h
19{c,g,h} 24{d} 30{b,f} 50{a} 60{e}

After 3rd iteration
Forest	Priority Queue
19 /\ / \ 30 / 10 /\ / /\ / \ / / \ * * * * * * * * 50 60 24 17 13 9 4 6 a e d b f c g h	19{c,g,h} 24{d} 30{b,f} 50{a} 60{e}

After 4th iteration
Forest Priority Queue

43 /\ / \ 19 \ /\ \ / \ \ 30 / 10 \ /\ / /\ \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d
30{b,f} 43{c,d,g,h} 50{a} 60{e}

After 4th iteration
Forest	Priority Queue
43 /\ / \ 19 \ /\ \ / \ \ 30 / 10 \ /\ / /\ \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d	30{b,f} 43{c,d,g,h} 50{a} 60{e}

After 5th iteration
Forest Priority Queue

73 / \ / \ / \ / 43 / /\ / / \ / 19 \ / /\ \ / / \ \ 30 / 10 \ /\ / /\ \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d
50{a} 60{e} 73{b,c,d,f,g,h}

After 5th iteration
Forest	Priority Queue
73 / \ / \ / \ / 43 / /\ / / \ / 19 \ / /\ \ / / \ \ 30 / 10 \ /\ / /\ \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d	50{a} 60{e} 73{b,c,d,f,g,h}

After 6th iteration
Forest Priority Queue

73 / \ / \ / \ / 43 / /\ / / \ / 19 \ / /\ \ / / \ \ 110 30 / 10 \ /\ /\ / /\ \ / \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d
73{b,c,d,f,g,h} 110{a,e}

After 6th iteration
Forest	Priority Queue
73 / \ / \ / \ / 43 / /\ / / \ / 19 \ / /\ \ / / \ \ 110 30 / 10 \ /\ /\ / /\ \ / \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d	73{b,c,d,f,g,h} 110{a,e}

After 7th iteration
Forest Priority Queue

183 / \ / \ / \ / 73 / / \ / / \ / / \ / / 43 / / /\ / / / \ / / 19 \ / / /\ \ / / / \ \ 110 30 / 10 \ /\ /\ / /\ \ / \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d
183{a,b,c,d,e,f,g,h}

After 7th iteration
Forest	Priority Queue
183 / \ / \ / \ / 73 / / \ / / \ / / \ / / 43 / / /\ / / / \ / / 19 \ / / /\ \ / / / \ \ 110 30 / 10 \ /\ /\ / /\ \ / \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d	183{a,b,c,d,e,f,g,h}

Labeling edges going to left children by 0 and those going to right children by 1, we end up with the following mapping from symbols to binary codewords:

Symbol Frequency Codeword

a (000) 50 00

b (001) 17 100

c (010) 9 1100

d (011) 24 111

e (100) 60 01

f (101) 13 101

g (110) 4 11010

h (111) 6 11011

Symbol	Frequency	Codeword
a (000)	50	00
b (001)	17	100
c (010)	9	1100
d (011)	24	111
e (100)	60	01
f (101)	13	101
g (110)	4	11010
h (111)	6	11011

Now we calculate how many bits would be needed to encode the original string/file/message using these codewords. To do so, we multiply each symbol's frequency by the length of its codeword. That is, letting freq(x) be the frequency of symbol x and len(x) be the length of its codeword, we compute the sum

Σ_x=a,b,...,h freq(x)·len(x)

For our example, we get

(50)(2) + (17)(3) + (9)(4) + (24)(3) + (60)(2) + (13)(3) + (4)(5) + (6)(5)

This comes out to 468. The "uncompressed" version of the file has length 549 (183 occurrences of symbols, each encoded by a bit string of length three). So the compression ratio is 468/549, which is approximately 85%. That is, the compressed version of the file is about 85% of the size of the uncompressed version. That's not a very good compression ratio, but that's because you typically aren't going to get a good ratio when the character set is small, unless one or two characters supply a very large percentage of the total frequencies. Huffman compression tends to give a much better ratio when the character set is larger and the range of frequencies is large.