Huffman coding

If we have an alphabet of symbols and each symbol has an associated frequency, Huffman coding is an algorithm that creates an encoding of symbols into bit strings. This encoding has a nice optimality property: It minimizes the product of the frequency and the length of the encoding of a symbol, summed over all the symbols. In practice, this is useful for lossless data compression: If you assume that the symbols are independent and identically distributed, Huffman encoding provides you with the optimal "fixed" encoding. In contrast, if we allow the encoding of a character to be different when the character occurs multiple times, arithmetic encoding is optimal.

Before defining Huffman trees and investigating some of their properties, we need some definitions:

A bit string is a sequence of 0s and 1s
A bit string $P$ is a prefix of a bit string $B$ when $B$ starts with $P$
An alphabet is a countable set of symbols
For our purposes, an encoding of a symbol is a bit string
A full binary tree is a tree where each node has either 0 or 2 child nodes
In a tree, nodes that have no child nodes are called leafs

Now, a Huffman tree for an alphabet $S$ is a full binary tree with one-to-one correspondence between leaf nodes and symbols in $S$ .

By convention, we associate the left child of a node with a 0 and the right child of a node with a 1. A Huffman tree $T$ provides us with an encoding $E_{T} (s)$ of a symbol $s$ . We can find it by starting at the root and following a path without loops to the leaf node associated to $s$ . We start with an empty bit string, and we take the left child node we append a 0 to the end of the bit string -- if we take the right child node we append a 1 to the end. From this, we see that the length of the encoding of a symbol $s$ equals the depth of the node associated to $s$ .

Example here

This is a Huffman tree $T$ for the alphabet C, D, E. We have $E_{T} (C) = 1$ , $E_{T} ()$ TODO TODO

The encoding process associates a mapping between a node and the bit string representing pattern that you need to follow from the root node to end up at this node. In this mapping, the bit string associated to a node $n$ is a prefix of the bit string associated to another node $m$ if and only if $m$ is a descendant of $n$ . By our definition of a Huffman tree, only leaf nodes have symbols associated to them. So, if $T$ is a Huffman tree for an alphabet $S$ and $s_{1}$ and $s_{2}$ are different symbols from $S$ , then $E_{T} (s_{1})$ and $E_{T} (s_{2})$ are never a prefix of each other.

Theorem: TODO

Encodings that have this property are also called prefix-free codes, or simply prefix codes.

The prefix-free property allows us to decode a bit string created by concatenating encoded symbols. If we would have two symbols $s_{1}$ and $s_{2}$ for which $E_{T} (s_{1})$ would be a prefix of $E_{T} (s_{2})$ , we would not know how to decode a bit string that starts with $E_{T} (s_{2})$ . It could either encode $s_{2}$ (and then have more bits from it from other encoded symbols), or encode $s_{1}$ (and then have more bits from it from other encoded symbols).

TODO: EXAMPLE: encoding and decoding

If we want to compress a sequence of symbols, we want to make sure that symbols that occur often have short encodings.

EXAMPLE

So, how we construct our Huffman tree will depend on how often the symbols occur. To capture this idea, we define the frequency $f (s)$ of a symbol $s$ . You can think of the frequency of a symbol as a weight that encodes how often the symbol occurs. So if $f (s_{1})$ is approximately twice $f (s_{2})$ , we would expect to see $s_{1}$ approximately twice as often as $s_{2}$ .

Using the frequency of the symbols, we can now define the cost $C$ of an encoding $E$ for an alphabet $S$ :

C (E) = s \in S \sum f (s) \cdot ∣ E (s) ∣

If we have an input file that is a sequence of $N$ symbols, we can set the frequency $f (s)$ of a symbol $s$ to the number of times this symbol occurs in the input file. This way, the cost $C (E)$ will equal the length of the encoded input file, counted in bits.

Definition: An encoding $E$ is optimal if it minimizes the cost $C (E)$ . That is, there is no other encoding $E^ $w i t h$ C(E^) < C(E)$. We call a Huffman tree optimal if the code it produces is optimal.

As stated before, we'd like symbols that occur often to have shorter encodings. We can phrase this as a principle: When the frequency of a symbol $s_{A}$ is higher than the frequency of another symbol $s_{B}$ , the encoding length of $s_{A}$ should not be longer than the encoding length of $s_{B}$ :

f (s_{A}) > f (s_{B}) \to ∣ E_{T} (s_{A}) ∣ \leq ∣ E_{T} (s_{B}) ∣

Suppose that we have a Huffman tree $T$ that violates this principle and we have some $s_{A}, s_{B}$ with $f (s_{A}) > f (s_{B})$ and $∣ E_{T} (s_{A}) ∣ > ∣ E_{T} (s_{B}) ∣$ . Now consider the tree $T^{'}$ obtained by switching the nodes associated to $s_{A}$ and $s_{B}$ .

and

E_{T^{'}} (s) = E_{T} (s)

for any $s \in S$ with $s \neq = s_{A}, s_{B}$

From this, we see

C (E_{T}) - C (E_{T^{'}}) = f (s_{A}) \cdot (∣ E_{T} (s_{A}) ∣ - ∣ E_{T^{'}} (s_{A}) ∣) + f (s_{B}) \cdot (∣ E_{T} (s_{B}) ∣ - ∣ E_{T^{'}} (s_{B}) ∣) = Δ C

Now consider the expression on the right hand side. Since we switched the encoding of $s_{A}$ and $s_{B}$ in $T^{'}$ , we have

∣ E_{T^{'}} (s_{A}) ∣ = ∣ E_{T} (s_{B}) ∣ < ∣ E_{T} (s_{A}) ∣ = ∣ E_{T^{'}} (s_{B}) ∣

Setting $c = E_{T} (s_{A}) - E_{T} (s_{B})$ we see that

Δ C = (f (s_{A}) - f (s_{B})) \cdot c

Since we assumed that $f (s_{A}) > f (s_{B})$ and $∣ E_{T} (s_{A}) ∣ > E_{T} (s_{B})$ we see that both $f (s_{A}) - f (s_{B})$ and $c = E_{T} (s_{A}) - E_{T} (s_{B})$ are positive. So $Δ C$ is positive as well, and we see that $C (E_{T^{'}}) < C (E_{T})$ .

Theorem: If $T$ is an optimal Huffman tree for an alphabet $S$ , then for any $s_{A}, s_{B} \in S$ we have

f (s_{A}) > f (s_{B}) \to ∣ E_{T} (s_{A}) ∣ \leq ∣ E_{T} (s_{B}) ∣

Unfortunately this principle on it's own is not enough to decide build an optimal tree.

TODO: example

So, we need to deepen our understanding a bit before we can see how to we can come up with an optimal Huffman tree.

By the principle we discovered earlier, we can deduce that the node associated to the symbol $s$ with the lowest frequency must be a leaf with maximum depth. Now, since a Huffman tree is a full binary tree, this node must have a sibling that is also a leaf. Let's say it's associated to the symbol $t$ . Now, this sibling is as deep as the node associated to $s$ , so surely, $t$ must also have a very low frequency if $T$ is optimal.

In fact, there is an optimal tree where $t$ is the symbol in $S ∖ {s}$ with lowest frequency. We can see this as follows: Suppose $T$ is an optimal tree where the sibling $t$ of $s$ does not have the minimum frequency. Then we can apply the same trick as before: Let $t^{'}$ be the symbol in $S ∖ {s}$ with lowest frequency. We create another tree $T^{'}$ by switching the nodes associated to $t$ and $t^{'}$ . We see that $C (E_{T}) - C (E_{T^{'}}) \leq 0$ so $T^{'}$ is optimal.

Theorem: Let $S$ be an alphabet of at least two symbols and let $s, t \in S$ be the two symbols with lowest frequency. Then there exists an optimal Huffman tree $T$ where the nodes associated to $s$ and $t$ are siblings and have maximum depth.

Now, we're getting closer to an algorithm for generating an optimal Huffman tree. We're looking for a Huffman tree containing a particular subtree that has two leafs, associated to symbols $s$ and $t$ . From this, we know that for the optimal tree $T$ we're looking for, we have $∣ E_{T} (s) ∣ = ∣ E_{T} (t) ∣$ since the nodes associated to $s$ and $t$ are at the same depth.

Now, we take a look at the computation of the cost function. Using that $∣ E_{T} (s) ∣ = ∣ E_{T} (t) ∣$ , we see that we can write

C (E_{T}) = r \in S \sum f (r) \cdot ∣ E_{T} (r) ∣ = (f (s) + f (t)) \cdot ∣ E_{T} (s) ∣ + r \in S ∖ {s, t} \sum f (r) \cdot ∣ E_{T} (r) ∣

In other words, the cost computation behaves identically to the situation where we replace the root of the subtree with a leaf node $u$ with frequency $f (u) = f (s) + f (t)$ . This implies the following theorem.

Theorem: Let $T$ be a Huffman tree with subtree $S$ , and let $T^{'}$ be the Huffman tree obtained by replacing $S$ with a single leaf node $s$ that has frequency $f (s)$ . If $T$ is optimal then $T^{'}$ is optimal.

Algorithm: Huffman coding

Let $A$ be an alphabet, and let $f (s)$ be the frequency for any $s \in A$ .

function huffman(A, f)
	while |A| > 1
		let s, t be the two nodes with minimal frequency

		u = node(left: s, right: t)
		u.frequency = s.frequency + t.frequency

		A = A - {s, t}
		A = A + {u}

	root = A[0]

	return root