Data Structures: Algorithms for Points, Strings, and Compression

Given points in a plane, we can construct a geometric graph. Its vertices correspond to the points, and each edge has a weight equal to the distance between its endpoints. (We might have all “n choose 2” possible edges, or maybe a sparser subset of edges.)

  • The EMST (Euclidean Minimum Spanning Tree): Given n points in the plane, find the lightest possible spanning tree made up of such edges.

We go over the definition (saying which edges are present), and also see a demo here: share/0326/delaunay/

This argument is not final exam review material.

Two pictures supporting the program part of homework 4:

  • The ETSP (Euclidean Traveling Salesperson Problem): Given n points in the plane, find a shortest cyclic tour through all the points.

Quadtree

Basically, it is a 4-ary tree data structure, useful for searching for “keys” which represent points in the plane. Each node in the tree stores a point, but also holds all points which fall in some rectangular “bounding box” of the plane. The subtrees under a point correspond to the four “quadrants” around that point.

Since we build the quadtree in a naive way (without rebalancing), we get better depth, expected O(log n), if we insert the n points in a random order. Just like we observed earlier for binary search trees, random insertion order helps.

Similar algorithms (in other books): bucket sort, counting sort.

  • Suppose we are sorting strings lexicographically: compare two strings by their first character, and if those are equal, then compare their second character, and so on. We can reduce string sorting to a sequence of small-integer sorting problems (where the characters are the small integers).

3-Way Radix Quicksort

3-way radix quicksort (see also
../book/quick3string.java). It saves space compared to MSD, since it
needs no extra array. It performs particularly well when the actual
alphabet is small, like in our pi10.txt example.

Nice feature (in slides, similar to quicksort):

  • On random strings, expect to use ~ 2 n ln n char comparisons
  • Beats quicksort, especially for strings with long common prefixes

Drawbacks of 3-way quicksort:

  • Not stable, because of the quicksort-like partition method.
 * Uses O(w*r) extra space (stack) in the unlikely worst case.
 * The slides say O(w + log n), which is true in expectation
   if you pick pivots randomly (or, pre-shuffle the input).
 * Like quicksort, ought to pick random elements as pivot, to avoid
   the (unlikely but possible) bad worst case behavior

Tries

Deletion (just the idea)
     Iteration
        Recursive, need "prefix" string for current sub-trie
        The order is preorder (prefixes first), not inorder
        Our "stackiterator" approach could be adapted here
     Prefix matching (find all keys with a given prefix)
     Longest-prefix-of (find 1 key)
     Patricia trie:
        Compresses out 1-way branching (now a substring per node)
        Works well with a small alphabet (e.g. binary or hexadecimal)
        More space efficient, especially for long common prefixes
        Faster initial key search: skip to next "forking" character
        We will skip implementation, but know how to read such a trie
     Suffix tree:
        Patricia of all suffixes, the "bananas" example
        Note that "banana" would be trickier, since some suffixes (na)
        are prefixes of others (nana). A fix here is to add a unique
        terminator char ("banana0") then each suffix is a leaf.
        In fact O(n) space (looks like quadratic space):
          it has O(n) nodes, and O(1) space per node, since
          we can represent each substring with O(1) space.
     There is a tricky O(n) time algorithm to construct the suffix
     tree of a given text, but we will only assert this.
     The suffix tree is slightly more general than the suffix array,
     and has similar applications (recall: "Manber's algorithm").

Huffman and LZW Running Times

Both Huffman and LZW, compression and expansion, are near linear time
in terms of the size n of the original text, assuming that you choose
reasonable data structures (arrays, tries, etc). In particular,
Huffman compression uses a priority queue to build up the binary trie on r leaves,
so its running time (with a binary heap) is O(n + r lg r). LZW needs
a code table of size 2^w, where w is the "bitsize" we choose for its
codewords (typically w ranges from 12 to 16, so 2^w is not huge).

Example: Huffman Coding

4 (Huffman)

4(a)
First find frequency f_c for each character c:

c  f_c
T  3
O  5
M  1
G  2
!  1

Now do the Huffman merges. I cannot draw this, but
we get a binary trie with this structure:
  (O+(T+((M+!)+G)))
Of course, you could do the left/right choices differently.
Using 0 to step left and 1 to step right, I get this codeword
table (for each character c):

c  code
T  10
O  0
M  1100
G  111
!  1101

4(b)
We can now encode S = "TOMGOTOTOGO!":
T OM   G  0T OT OG  O!
1001100111010010011101101
That has 25 bits. In fact we could have worked that
out as the sum, over c, of f_c * l_c, where l_c is
the length of the code for c.

The trie can be encoded, as described in the book,
with 1 bit per node (9 nodes, internal or leaf)
plus 8 bits per leaf (5 leaves, each holding a char).
So: 9+5*8 = 49 bits to encode the trie.

Example: LZW Compression

5 (LZW)

Sending "ABABABAABAAB" = "A B AB ABA ABAA B" =
send  41 = A
 add  81 = AB
send  42 = B
 add  82 = BA
send  81 = AB
 add  83 = ABA
send  83 = ABA
 add  84 = ABAA
send  84 = ABAA
 add  85 = ABAAB
send  42 = B
send  80 = end-of-file


5(b) Decode this sequence: 42, 41, 41, 81, 83, 85, 86, 42, 80

got 42, decode B
got 41, decode A, add 81 = BA
got 41, decode A, add 82 = AA
got 81, decode BA, add 83 = AB
got 83, decode AB, add 84 = BAA
got 85, special case! add 85 = AB+A=ABA, decode ABA
got 86, special case! add 86 = ABA+A=ABAA, decode ABAA
got 42, decode B, add 87 = ABAAB
got 80, all done.

Result: BAABAABABAABAAB

LSD sort: right to left

MSD sort: left to right