Input: An alphabet A of "symbols" (of which some "source text" is composed), and a mapping f : A → ℕ (ℕ = set of natural numbers) where, for each symbol a ∈ A, f(a) is the "frequency" with which symbol a occurs in the source text.
Output: a mapping g : A → BS (BS = set of all bit strings) associating a (binary) codeword to each member of alphabet A.
Algorithm:
Q := empty priority queue (of weighted binary trees; lower weight equals higher priority) do for each a ∈ A t := one-node tree with label a and weight f(a) Q.insert(t) --insert one-node tree t into Q od do |A| - 1 TIMES t1 := Q.deleteMin(); // extract from Q the two trees t2 := Q.deleteMin(); // having lowest weights t := binary tree whose root is a new node having as left and right children the roots of t1 and t2. The weight of t is taken to be the sum of the weights of t1 and t2. Q.insert(t) // insert t into Q od // At this point, Q contains a single tree, the leaves of which are // precisely the |A| nodes with which we began. t := lone tree in Q perform depth first search in t so as to compute g(a) for each a in A: Label each edge of t by either 0 or 1, according to whether it goes to a left child or a right child, respectively. Then g(a) corresponds to the sequence of labels on the edges along the path from the root of t to the leaf node labeled a. |
Symbol | Frequency |
---|---|
a (000) | 50 |
b (001) | 17 |
c (010) | 9 |
d (011) | 24 |
e (100) | 60 |
f (101) | 13 |
g (110) | 4 |
h (111) | 6 |
Huffman's Algorithm would begin by constructing a forest of eight trees, each having a single node, and a priority queue filled with those trees:
Forest | Priority Queue |
---|---|
* * * * * * * * 50 17 9 24 60 13 4 6 a b c d e f g h |
4{g} 6{h} 9{c} 13{f} 17{b} 24{d} 50{a} 60{e} |
During the first loop iteration, the two "lightest" trees will be connected to a new parent node and replaced on the priority queue with the tree rooted at that new parent node. That will leave this:
Forest | Priority Queue |
---|---|
10 /\ / \ * * * * * * * * 50 17 9 24 60 13 4 6 a b c d e f g h |
9{c} 10{g,h} 13{f} 17{b} 24{d} 50{a} 60{e} |
The following iterations will go like this:
Forest | Priority Queue |
---|---|
19 /\ / \ / 10 / /\ / / \ * * * * * * * * 50 17 24 60 13 9 4 6 a b d e f c g h |
13{f} 17{b} 19{c,g,h} 24{d} 50{a} 60{e} |
Forest | Priority Queue |
---|---|
19 /\ / \ 30 / 10 /\ / /\ / \ / / \ * * * * * * * * 50 60 24 17 13 9 4 6 a e d b f c g h |
19{c,g,h} 24{d} 30{b,f} 50{a} 60{e} |
Forest | Priority Queue |
---|---|
43 /\ / \ 19 \ /\ \ / \ \ 30 / 10 \ /\ / /\ \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d |
30{b,f} 43{c,d,g,h} 50{a} 60{e} |
Forest | Priority Queue |
---|---|
73 / \ / \ / \ / 43 / /\ / / \ / 19 \ / /\ \ / / \ \ 30 / 10 \ /\ / /\ \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d |
50{a} 60{e} 73{b,c,d,f,g,h} |
Forest | Priority Queue |
---|---|
73 / \ / \ / \ / 43 / /\ / / \ / 19 \ / /\ \ / / \ \ 110 30 / 10 \ /\ /\ / /\ \ / \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d |
73{b,c,d,f,g,h} 110{a,e} |
Forest | Priority Queue |
---|---|
183 / \ / \ / \ / 73 / / \ / / \ / / \ / / 43 / / /\ / / / \ / / 19 \ / / /\ \ / / / \ \ 110 30 / 10 \ /\ /\ / /\ \ / \ / \ / / \ \ * * * * * * * * 50 60 17 13 9 4 6 24 a e b f c g h d |
183{a,b,c,d,e,f,g,h} |
Labeling edges going to left children by 0 and those going to right children by 1, we end up with the following mapping from symbols to binary codewords:
Symbol | Frequency | Codeword |
---|---|---|
a (000) | 50 | 00 |
b (001) | 17 | 100 |
c (010) | 9 | 1100 |
d (011) | 24 | 111 |
e (100) | 60 | 01 |
f (101) | 13 | 101 |
g (110) | 4 | 11010 |
h (111) | 6 | 11011 |
Now we calculate how many bits would be needed to encode the original string/file/message using these codewords. To do so, we multiply each symbol's frequency by the length of its codeword. That is, letting freq(x) be the frequency of symbol x and len(x) be the length of its codeword, we compute the sum
For our example, we get
This comes out to 468. The "uncompressed" version of the file has length 549 (183 occurrences of symbols, each encoded by a bit string of length three). So the compression ratio is 468/549, which is approximately 85%. That is, the compressed version of the file is about 85% of the size of the uncompressed version. That's not a very good compression ratio, but that's because you typically aren't going to get a good ratio when the character set is small, unless one or two characters supply a very large percentage of the total frequencies. Huffman compression tends to give a much better ratio when the character set is larger and the range of frequencies is large.