Huffman coding

If we have an alphabet of symbols and each symbol has an associated frequency, Huffman coding is an algorithm that creates an encoding of symbols into bit strings. This encoding has a nice optimality property: It minimizes the product of the frequency and the length of the encoding of a symbol, summed over all the symbols. In practice, this is useful for lossless data compression: If you assume that the symbols are independent and identically distributed, Huffman encoding provides you with the optimal "fixed" encoding. In contrast, if we allow the encoding of a character to be different when the character occurs multiple times, arithmetic encoding is optimal.

Before defining Huffman trees and investigating some of their properties, we need some definitions:

Now, a Huffman tree for an alphabet is a full binary tree with one-to-one correspondence between leaf nodes and symbols in .

By convention, we associate the left child of a node with a 0 and the right child of a node with a 1. A Huffman tree provides us with an encoding of a symbol . We can find it by starting at the root and following a path without loops to the leaf node associated to . We start with an empty bit string, and we take the left child node we append a 0 to the end of the bit string -- if we take the right child node we append a 1 to the end. From this, we see that the length of the encoding of a symbol equals the depth of the node associated to .

Example here

This is a Huffman tree for the alphabet C, D, E. We have , TODO TODO

The encoding process associates a mapping between a node and the bit string representing pattern that you need to follow from the root node to end up at this node. In this mapping, the bit string associated to a node is a prefix of the bit string associated to another node if and only if is a descendant of . By our definition of a Huffman tree, only leaf nodes have symbols associated to them. So, if is a Huffman tree for an alphabet and and are different symbols from , then and are never a prefix of each other.

Theorem: TODO

Encodings that have this property are also called prefix-free codes, or simply prefix codes.

The prefix-free property allows us to decode a bit string created by concatenating encoded symbols. If we would have two symbols and for which would be a prefix of , we would not know how to decode a bit string that starts with . It could either encode (and then have more bits from it from other encoded symbols), or encode (and then have more bits from it from other encoded symbols).

TODO: EXAMPLE: encoding and decoding

If we want to compress a sequence of symbols, we want to make sure that symbols that occur often have short encodings.

EXAMPLE

So, how we construct our Huffman tree will depend on how often the symbols occur. To capture this idea, we define the frequency of a symbol . You can think of the frequency of a symbol as a weight that encodes how often the symbol occurs. So if is approximately twice , we would expect to see approximately twice as often as .

Using the frequency of the symbols, we can now define the cost of an encoding for an alphabet :

If we have an input file that is a sequence of symbols, we can set the frequency of a symbol to the number of times this symbol occurs in the input file. This way, the cost will equal the length of the encoded input file, counted in bits.

Definition: An encoding is optimal if it minimizes the cost . That is, there is no other encoding $E^C(E^) < C(E)$. We call a Huffman tree optimal if the code it produces is optimal.

As stated before, we'd like symbols that occur often to have shorter encodings. We can phrase this as a principle: When the frequency of a symbol is higher than the frequency of another symbol , the encoding length of should not be longer than the encoding length of :

Suppose that we have a Huffman tree that violates this principle and we have some with and . Now consider the tree obtained by switching the nodes associated to and .

and

for any with

From this, we see

Now consider the expression on the right hand side. Since we switched the encoding of and in , we have

Setting we see that

Since we assumed that and we see that both and are positive. So is positive as well, and we see that .

Theorem: If is an optimal Huffman tree for an alphabet , then for any we have

Unfortunately this principle on it's own is not enough to decide build an optimal tree.

TODO: example

So, we need to deepen our understanding a bit before we can see how to we can come up with an optimal Huffman tree.

By the principle we discovered earlier, we can deduce that the node associated to the symbol with the lowest frequency must be a leaf with maximum depth. Now, since a Huffman tree is a full binary tree, this node must have a sibling that is also a leaf. Let's say it's associated to the symbol . Now, this sibling is as deep as the node associated to , so surely, must also have a very low frequency if is optimal.

In fact, there is an optimal tree where is the symbol in with lowest frequency. We can see this as follows: Suppose is an optimal tree where the sibling of does not have the minimum frequency. Then we can apply the same trick as before: Let be the symbol in with lowest frequency. We create another tree by switching the nodes associated to and . We see that so is optimal.

Theorem: Let be an alphabet of at least two symbols and let be the two symbols with lowest frequency. Then there exists an optimal Huffman tree where the nodes associated to and are siblings and have maximum depth.

Now, we're getting closer to an algorithm for generating an optimal Huffman tree. We're looking for a Huffman tree containing a particular subtree that has two leafs, associated to symbols and . From this, we know that for the optimal tree we're looking for, we have since the nodes associated to and are at the same depth.

Now, we take a look at the computation of the cost function. Using that , we see that we can write

In other words, the cost computation behaves identically to the situation where we replace the root of the subtree with a leaf node with frequency . This implies the following theorem.

Theorem: Let be a Huffman tree with subtree , and let be the Huffman tree obtained by replacing with a single leaf node that has frequency . If is optimal then is optimal.

Algorithm: Huffman coding

Let be an alphabet, and let be the frequency for any .

function huffman(A, f)
	while |A| > 1
		let s, t be the two nodes with minimal frequency

		u = node(left: s, right: t)
		u.frequency = s.frequency + t.frequency

		A = A - {s, t}
		A = A + {u}

	root = A[0]

	return root