Huffman coding

Motivation

Compression (a short video clip on text compression); unlike ASCII or Unicode encoding, which use the name number of bits to encode each character, a Huffman code uses different numbers of bits to encode the letters: more bits for rare letters, and fewer bits for common letters.
Compression can be achieved further at words level and so on.
Image compression/music compression/text compression/genomic sequence compression (see a paper about genome compression)

Definition

Huffman coding assigns codes to characters such that the length of the code depends on the relative frequency or weight of the corresponding character. Huffman codes are of variable-length, and prefix-free (no code is prefix of any other). Any prefix-free binary code can be visualized as a binary tree with the encoded characters stored at the leaves.

A Huffman coding tree or Huffman tree is a full binary tree in which each leaf of the tree corresponds to a letter in the given alphabet.

Define: the weighted path length of a leaf to be its weight times its depth.
The Huffman tree is the binary tree with minimum external path weight, i.e., the one with the minimum sum of weighted path lengths for the given set of leaves.
So the goal is to build a tree with the minimum external path weight.

See an example below:

Letter freqency table
LetterZKMCUDLE
Frequency272432374242120

Huffman code
LetterFreqCode Bits
E 120 0 1
D 42 101 3
L 42 110 3
U 37 100 3
C 32 1110 4
M 24 11111 5
K 7 1111016
Z 2 1111006

The Huffman tree (Shaffer Fig. 5.24)
Huffman tree

Three problems

Problem 2: Encoding

Encoding a string can be done by replacing each letter in the string with its binary code (the Huffman code).

Examples:
DEED 10100101 (8 bits)
MUCK 111111001110111101 (18 bits)

Problem 3: Decoding

Decoding an encoded string can be done by looking at the bits in the coded string from left to right until a letter decoded.
10100101 -> DEED
The Huffman codes are prefix free--a set of codes is said to meet the prefix property if no code in the set is the prefix of another. The prefix property guarantees that there will be no ambiguity in how a bit string is decoded.

Problem 1: Huffman tree building

A simple algorithm (buildHuff):

  1. Prepare a collection of n initial Huffman trees, each of which is a single leaf node. Put the n trees onto a priority queue organized by weight (frequency).
  2. Remove the first two trees (the ones with lowest weight). Join these two trees to create a new tree whose root has the two trees as children, and whose weight is the sum of the weights of the two children trees.
  3. Put this new tree into the priority queue.
  4. Repeat steps 2-3 until all of the partial Huffman trees have been combined into one.

It's a greedy* algorithm: at each iteration, the algorithm makes a "greedy" decision to merge the two subtrees with least weight. Does it give the desired result?

Implementation

See an implementation at github (HuffTree.java) (note getCode function is to be implemented by students)

For bored: compression algorithms; lossless vs lossy


(image taken from https://www.cs.cmu.edu/~guyb/realworld/compression.pdf)