Data Structures: Algorithms for Points, Strings, and Compression
Given points in a plane, we can construct a geometric graph. Its vertices correspond to the points, and each edge has a weight equal to the distance between its endpoints. (We might have all “n choose 2” possible edges, or maybe a sparser subset of edges.)
- The EMST (Euclidean Minimum Spanning Tree): Given n points in the plane, find the lightest possible spanning tree made up of such edges.
We go over the definition (saying which edges are present), and also see a demo here: share/0326/delaunay/
- An argument that the Delaunay triangulation is planar: no two edges can cross each other. This depends on a claim about the angles at four points around a circle, see here: http://en.wikipedia.org/wiki/cyclic_quadril8ral
This argument is not final exam review material.
Two pictures supporting the program part of homework 4:
- The ETSP (Euclidean Traveling Salesperson Problem): Given n points in the plane, find a shortest cyclic tour through all the points.
Quadtree
Basically, it is a 4-ary tree data structure, useful for searching for “keys” which represent points in the plane. Each node in the tree stores a point, but also holds all points which fall in some rectangular “bounding box” of the plane. The subtrees under a point correspond to the four “quadrants” around that point.
Since we build the quadtree in a naive way (without rebalancing), we get better depth, expected O(log n), if we insert the n points in a random order. Just like we observed earlier for binary search trees, random insertion order helps.
Similar algorithms (in other books): bucket sort, counting sort.
- Suppose we are sorting strings lexicographically: compare two strings by their first character, and if those are equal, then compare their second character, and so on. We can reduce string sorting to a sequence of small-integer sorting problems (where the characters are the small integers).
3-Way Radix Quicksort
3-way radix quicksort (see also ../book/quick3string.java). It saves space compared to MSD, since it needs no extra array. It performs particularly well when the actual alphabet is small, like in our pi10.txt example.
Nice feature (in slides, similar to quicksort):
- On random strings, expect to use ~ 2 n ln n char comparisons
- Beats quicksort, especially for strings with long common prefixes
Drawbacks of 3-way quicksort:
- Not stable, because of the quicksort-like partition method.
* Uses O(w*r) extra space (stack) in the unlikely worst case. * The slides say O(w + log n), which is true in expectation if you pick pivots randomly (or, pre-shuffle the input). * Like quicksort, ought to pick random elements as pivot, to avoid the (unlikely but possible) bad worst case behavior
Tries
Deletion (just the idea) Iteration Recursive, need "prefix" string for current sub-trie The order is preorder (prefixes first), not inorder Our "stackiterator" approach could be adapted here Prefix matching (find all keys with a given prefix) Longest-prefix-of (find 1 key) Patricia trie: Compresses out 1-way branching (now a substring per node) Works well with a small alphabet (e.g. binary or hexadecimal) More space efficient, especially for long common prefixes Faster initial key search: skip to next "forking" character We will skip implementation, but know how to read such a trie Suffix tree: Patricia of all suffixes, the "bananas" example Note that "banana" would be trickier, since some suffixes (na) are prefixes of others (nana). A fix here is to add a unique terminator char ("banana0") then each suffix is a leaf. In fact O(n) space (looks like quadratic space): it has O(n) nodes, and O(1) space per node, since we can represent each substring with O(1) space. There is a tricky O(n) time algorithm to construct the suffix tree of a given text, but we will only assert this. The suffix tree is slightly more general than the suffix array, and has similar applications (recall: "Manber's algorithm").
Huffman and LZW Running Times
Both Huffman and LZW, compression and expansion, are near linear time in terms of the size n of the original text, assuming that you choose reasonable data structures (arrays, tries, etc). In particular, Huffman compression uses a priority queue to build up the binary trie on r leaves, so its running time (with a binary heap) is O(n + r lg r). LZW needs a code table of size 2^w, where w is the "bitsize" we choose for its codewords (typically w ranges from 12 to 16, so 2^w is not huge).
Example: Huffman Coding
4 (Huffman) 4(a) First find frequency f_c for each character c: c f_c T 3 O 5 M 1 G 2 ! 1 Now do the Huffman merges. I cannot draw this, but we get a binary trie with this structure: (O+(T+((M+!)+G))) Of course, you could do the left/right choices differently. Using 0 to step left and 1 to step right, I get this codeword table (for each character c): c code T 10 O 0 M 1100 G 111 ! 1101 4(b) We can now encode S = "TOMGOTOTOGO!": T OM G 0T OT OG O! 1001100111010010011101101 That has 25 bits. In fact we could have worked that out as the sum, over c, of f_c * l_c, where l_c is the length of the code for c. The trie can be encoded, as described in the book, with 1 bit per node (9 nodes, internal or leaf) plus 8 bits per leaf (5 leaves, each holding a char). So: 9+5*8 = 49 bits to encode the trie.
Example: LZW Compression
5 (LZW) Sending "ABABABAABAAB" = "A B AB ABA ABAA B" = send 41 = A add 81 = AB send 42 = B add 82 = BA send 81 = AB add 83 = ABA send 83 = ABA add 84 = ABAA send 84 = ABAA add 85 = ABAAB send 42 = B send 80 = end-of-file 5(b) Decode this sequence: 42, 41, 41, 81, 83, 85, 86, 42, 80 got 42, decode B got 41, decode A, add 81 = BA got 41, decode A, add 82 = AA got 81, decode BA, add 83 = AB got 83, decode AB, add 84 = BAA got 85, special case! add 85 = AB+A=ABA, decode ABA got 86, special case! add 86 = ABA+A=ABAA, decode ABAA got 42, decode B, add 87 = ABAAB got 80, all done. Result: BAABAABABAABAAB
LSD sort: right to left
MSD sort: left to right