_ _ _ _ _ / \ | | __ _ ___ _ __(_) |_| |__ _ __ ___ ___ / _ \ | |/ _` |/ _ \| '__| | __| '_ \| '_ ` _ \/ __| / ___ \| | (_| | (_) | | | | |_| | | | | | | | \__ \ /_/ \_\_|\__, |\___/|_| |_|\__|_| |_|_| |_| |_|___/ |___/ ============================================================================== SORTING ALGORITHMS ============================================================================== Selection Sort: Always O(n^2) regardless of data Suitable on linked lists where removal and addition are cheap Generally dominated by insertion sort, which does half as many comps on avg Only advantage is when swaps are expensive; requires O(n) not O(n^2) for ins Primarily of historical interest Algorithm: Find the smallest element Swap it with the first element Repeat until done Insertion Sort: Desirable when data are already almost sorted; otherwise O(n^2) Stable, in-place Fewer comparisons than selection sort but more writes May be faster than quicksort or mergesort for smallish number of elements When comparisons are expensive, a variant with binary search can be used CLR page 8 Algorithm: Start at item 2 Move the item left until it is not smaller than the item left of it Repeat until done Shell Sort: Variant of Insertion Sort which moves in decreasing gaps. Running time order is a function of gap series. O(n^4/3) and O(n*lg(n)^2) is known. There is a proof that n*lg(n) cannot be achieved. For gap k, sorts 1st, kth, 2kth, elements. Then sorts 2nd, k+1th, etc. A favorable sequence is 1, 4, 10, 23, 57, 132, 301, 701, 1750. Merge Sort: Algorithm: If there is more than one element: Sort first half (recursive call) Sort second half (recursive call) Merge two halves Notes: Requires additional storage (not in-place). Recursive. CLR page 13 Heap Sort: O(n lg n). Generally dominated by quick sort in practice. See section on heaps for details. Quick Sort: O(n^2) worst case, but O(n lg n) average case, and constant factors low. In-place memory usage. Good performance requires balanced partitioning. Naive algorithm has worst-case time on already sorted input by using initial element as the pivot. CLR page 153; entirety of Chapter 8. Quicksort: If input array is greater than one element Partition it, getting divider as return value Run quicksort on each side Partition: Set pivot = first element (other more intelligent methods possible) Take two indices, one at the front and one at the back. Advance the front index until it points to an element >= pivot Retreat the back index until it points to an element <= pivot If the indices have crossed, return index location Otherwise, swap elements and repeat Radix Sort: Sort on least significant digit first (an O(n) operation). Then next least, etc. Of course a stable sort must be used. Not in-place. Note that time is a function of the size of the numbers and not just the number of items. CLR page 178. Bucket Sort: Assumes input is randomly distributed over an interval. Then divides this interval into n buckets and puts the n input numbers into them. Use any kind of sort within the bucket, since the number of items will be small. A linked list for buckets is common. If the input is *not* evenly distributed, the within-bucket sort can dominate. CLR page 180. ============================================================================== PARTIAL SORTING/SELECTION ALGORITHMS ============================================================================== Min and max: There is a cute trick to reduce the number of comparisons when finding both min and max. Compare each two elements pairwise, and then compare only the larger with the max and the smaller with the min. This gets you 3/2n comparisons instead of 2n. CLR page 186. Selection Algorithm for Ith Largest Element: Proceed like a (randomized) quicksort, but each time recurse only on the half which will contain the desired answer. CLR page 187. There is further discussion on ways of guaranteeing a good pivot. ============================================================================== MISCELLANEOUS DATA STRUCTURES ============================================================================== STACKS, QUEUES ~~~~~~~~~~~~~~ CLR chapter 11. Queues generally implemened in a ring buffer. Also consider deques, not in CLR. LINKED LISTS ~~~~~~~~~~~~ Remember that you can make a ring buffer with a sentinel between the beginning and end, which is sometimes convenient. Consider the merits of singly vs. doubly linked lists. Insertion and deletion are cheap; indexing is expensive. Skip lists: The bottom layer is an ordinary linked list. Each higher layer contains an element with some probability, often 1/2 or 1/4. The expected search time is O(log N); P can be varied to trade storage space against search time. They have properties comparable to trees with random insertions, but without requiring insertions to be random. The best place to start a search is the level where we expect 1/p nodes, or log_1/p(n). However, starting at the highest level is also generally OK and adds only a small constant factor. It is also suggested that the level of a new node could be constrained to be at most 1 higher than the existing maximum level, which works well in practice but is difficult to analyze. The absolute fastest search is with p=1/e, but it is only 94% of the 1/2 and 1/4 cases. 1/4 is just as fast on average as 1/2, with less storage, but the variability is higher. See ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf DISJOINT SETS ~~~~~~~~~~~~~ CLR page 440. Used in Kruskal's algorithm for minimum spanning trees. The set has a representative item, which may be arbitrary or may have a specific algorithm for its selection. The operations make a set from a singleton, unite sets, and find the set containing an item. Linked List Representation: Representative is first object. Each object contains a pointer back to the representative. Thus find-set is trivial because it gets the representative from the object (note that you must have the set object, not a copy of the key). Union then just appends one linked list to the other, but it must update the representative pointers, taking O(m) time. As a heuristic for a minor improvement, always add the smaller set to the larger. For m operations, of which n are make-set, this gives O(m+n lg n); CLR 445. Forest Representation: Each tree is a set. Each member points only to parent, not to root of tree. Find-set then requires walking up the tree to the root. Union points the root of one tree at the root of the other tree (they're not binary trees). This by itself is not necessarily an improvement, as no assertions about favorable height of trees can be made; they could be degenerate (linked lists). Union by Rank: Make the root of the tree with fewer nodes point at the root of the tree with more nodes instead of the other way around (CLR 447). To do this efficiently, we must maintain the rank as we go instead of trying to compute it when doing the union. The rank is an upper bound on the height, but may conceivably overestimate it. If the ranks of the trees being joined are exactly equal, the rank of the new root will go up by 1. Path Compression: When performing a find, update all nodes on the path to point directly to the root. Ranks records are unchanged, and thus could conceivably become an overestimate. The compression is a two-pass procedure. Running Time: With both heuristics, O(m alpha(m,n)) for m operations on n elements. Alpha is the Ackermann function inverse, <= 4. Proof is difficult. CLR p 453 presents a weaker proof. HASH TABLES ~~~~~~~~~~~ Overview: O(1) if you get it right, but of course O(n) if your hash sucks and everything goes in one bin. In Chaining, singly linked list is used to store multiple elements in one bin. A doubly-linked list makes deletion cheap if you already have a pointer to the element. Consider a table with n elements and m slots; the hash factor alpha is n/m. An unsuccessful search is then theta(1+alpha) on average. Alternatively, you can store everything in the table itself, by moving onward when you have a collision (linearly, quadratic, double hashing, etc). Note that special behavior is required on deletion so that subsequent values are not hidden by the empty slot stopping the search. CLR prefers double-hashing, which generates the most sequences, and describes it on page 235. Hash Functions: CLR has a very short discussion of hash functions on page 226. It is common to interpret the key as a natural number, such as by taking a string as a number in some radix. One simple approach is dividing this number by a prime which is not too close to a power of 2 (CLR 228) and taking the remainder. The second is to multipley by a floating point number A and then multiply by an integer m, often a power of 2. Without explanation, (sqrt(5)-1)/2 is suggested for A, with a reference to Knuth. For double hash functions to support open addressing, an example is: m is prime, m' is slightly less than m h1(k) = k mod m h2(k) = 1 + (k mod m') h(k,i) = (h1(k) + ih2(k)) mod m Birthday Problem: As a very rough approximation, there is ~50% probability of a collision once there are 2^(N/2) things in an N bit hash. A usable approximation is p(n) = 1 - e^(-n^2/(2*size)) For n=2^32 and size=2^64, this gives p=1-e^(-.5) = 0.4 Working this the other way around, 0.5 = e^(-0.6931), or n^2 = 1.39 * size so for a 64-bit hash, 2^(32.24) things result in 50% collision probability. http://en.wikipedia.org/wiki/Birthday_problem has several other formulae Consistent Hashing: From the paper "Consistent Hashing and Random Trees". Assume that each person can see at least a constant fraction of the buckets which exist. The desired properties are balance (even assignment of items to visible buckets), monotonicity (when new buckets become visible, things may move to the new ones, but not one of the old ones), spread (the number of different buckets to which something gets assigned should be small), and load (the buckets should all have the same number of items). Suppose rb and rx are random functions. rb maps buckets to the unit interval and rx maps items. fv(i) is the bucket b in V which minimizes the distance between rb(b) and rx(i). If there are at most C buckets across all views, each bucket is actually replicated k*log(C) times, which are then randomly mapped with rb(). A simple implementation uses a binary search tree to store the correspondence between unit interval segments and buckets. There are k*C*log(C) intervals, so the depth is log(C); inserting or removing k points takes log^2(C). To get the hash computation time down to O(1), divide the interval into k*C*log(C) equal length segments, and keep a separate tree for each, with expected size O(1). Since these must shrink as buckets are added, use intervals of length 1/2^x <= 1/(k*C*log(C), choosing the largest x. Then divide them in half. It is also important that the expected length of a run of empty intervals is small, since they must all be updated when a bucket is added. ============================================================================== HEAPS ============================================================================== CLR page 401: Procedure Binary Heap Binomial Heap Fibonacci Heap (worst) (worst) (amortized) Make-Heap 1 1 1 Insert lg n lg n 1 Minimum 1 lg n 1 Extract-Min lg n lg n lg n Union n lg n 1 Decrease-Key lg n lg n 1 Delete lg n lg n lg n BINARY HEAPS ~~~~~~~~~~~~ The heap is organized such that each child is less than its parent. It can be stored in an array, with the first row being element 1, the second row being elements 2 to 3, etc. Thus, the left side must be populated first. Generally, the heap sort is dominated by the quick sort, but heaps are more useful as priority queues. CLR chapter 7. Heapify: Input is a location and heap. Assumes children are proper heaps, but item itself might be too small. O(n lg(n)) (from height of heap) Algorithm: If there is a left child: Set largest= the larger of self and left child top If there is a right child: If top of right child > largest largest=top of right child If largest is not self Exchange Recursively call on new largest Build-Heap: Repeatedly calls heapify to make a heap from an unordered array. Start at the element which is half the length, and call heapify on everything from there decreasing to the first element. Running time is O(n), proven by showing that the running time of heapify is lower at nodes close to the bottom. CLR page 145 Heap Sort: O (n lg n) Build the heap For i in last-element to element 2: exchange first and ith element (puts largest element last) Decrease global heap size by 1 heapify starting at first element (puts new largest element first) Heap Insert: O(lg n) Put new element in at end. Bubble it up until it is larger than its children. More formally: Increase length by 1 i=heap size while i>1 and parent of element i < new value set element i = element i's parent i = parent(i) element i = new value Heap Extract: O(lg n) Take off first element Move last element to front Rerun heapfify() BINOMIAL HEAPS ~~~~~~~~~~~~~~ Binomial heaps are interesting when we need to take the union of them, which can be done in lg(n) instead of n. Otherwise, just use ordinary binary heaps. CLR chapter 20, page 400. A binomial heap is made up of binomial trees. There is at most one tree of each degree. The trees are heap-ordered - the key of a node is greater than or equal to its parent. (isn't this the opposite convention from that used in describing binary heaps?) See also section on binomial trees. Representation: Keeps track of parent, left child, and right sibling, and notes own degree. Roots of trees are in singly linked list in increasing degree, giving an alternative meaning to the sibling field. Make-binomial-heap: Creates an empty heap with one null pointer. Minimum: The minimum will be at the root of a tree, so just search the initial linked list, which will be of length order lg(n). Binomial-Link (child y,parent z): Set y'z parent to z Set y's sibling to z's child Set z's child to y Increment z's degree Union (a, b): We will first merge the root lists. This could leave as many as two trees of each degree, and we need at most one. Thus, for each two that are the same degree, we merge them into one of the next higher degree, working our way upward. It gets a little complicated because after joining two trees, we might temporarily have THREE of the same degree, and we have to leave the first as-is and merge the last two. e.g. 1,1,2,2,3 -> 2,2,2,3 -> 2,3,3 -> 2,4 Time proportional to number of trees, or lg(n) Details: Create a new heap H. Merge a and b into H, joining root lists in monotonically inc. degree order if H is empty, return Set prevx to null set x to head of H set nextx to x's sibling while nextx: if degree(x) != degree(nextx) ;1 of this degree OR (sibling(nextx) exists AND degree(sibling(nextx))==degree(x)) ;3 ; Don't need to merge. Advance prevx = x x = nextx else if key(x) <= key(nextx) ; make nextx a child of x sibling(x)=sibling(nextx) binomial-link(nextx, x) else if not prevx: head = nextx else sibling(prevx)=nextx binomial-link(x, nextx) x=nextx nextx = sibling(x) Insert: Make a new heap to contain the element. Set its parent, child, and sibling to null Set its degree to 0 Union(existing heap, new one element heap) Extract-min: Find the tree with the smallest root Make a new heap out of the remainder of this tree: The linked list will be the root's children in reverse order Unite the old heap (less the removed tree) and the new heap. (CLR page 413, with illustrative figure) All O(lg n), proportional to number of trees Decrease-Key: Error if key has gone up Otherwise, bubble node up until it is not smaller than parent Delete: Decrease key to -infinity Extract minimum (cute eh?) FIBONACCI HEAPS ~~~~~~~~~~~~~~~ Insert, minimum, and union are O(1). Extract-min and delete are O(lg n). Decrease-key is O(1) amortized, significant as a basis in graph algorithms (see for example Dijkstra's algorithm). CLR chapter 21, p420. ... ============================================================================== TREES ============================================================================== Most operations are O(h), where h is height. If the tree is balanced, this is lg(n), but it can be n if the tree is degenerate (think of inserting in sorted order). BINARY TREES ~~~~~~~~~~~~ Consider how to store equal values - CLR lets them be the left or right child. If you do not store pointers to parent, successor (and thus deletion) may be problematic. Recursive search is trivial; so are min and max. Search: If equal, return If empty, return not found If value less than current node, search left child Otherwise search right child Successor: If right child is not empty, return its minimum. Otherwise, move to parent until you find a parent to whom you are the left child instead of the right. O(h). Requires ability to move up to parent. Insertion: Like a search, but when you get to the bottom, you insert a new node. Deletion: One of the more complex operations. CLR page 251. Algorithm: Node z is to be removed; p is its parent. If z has no children: Modify p to replace z with a null If z has one child: Make this child a direct child of p If z has two children: Take z's successor y (which has no left child - this is always true) Remove y Replace z with y RED-BLACK TREES ~~~~~~~~~~~~~~~ Stores one bit of color on each node. Assurance is that no path is more than twice as long as any other. Coloring Invariants: Every node is either red or black Every leaf is black and is NIL If a node is red, both its children are black Every simple path to a leaf contains the same number of black nodes (some definitions also give the root as always black. CLR does not.) Max height 2*lg(n+1); proof CLR page 264. General idea is that you cannot have two red nodes in a row, so the shortest is all black and the longest alternates. Other variants exist in which leaves are not always NIL. This changes some of the fundamentals and is considered more complex. There are also variants in which the edges are spoken of as colored instead of notes. Analgous to B-trees of order 4 (discussed on wp article) Basic operation for restoring red-black property after insertions and deletions is "rotation": y x x/ \g <------> a/ \y a/ \b b/ \g Left-Rotate: y = right(x) right(x) = left(y) if left(y) parent(left(y))=x parent(y)=parent(x) if x has no parent: root=y else if x is its parent's left child left(parent(x))=y else right(parent(x))=y left(y)=x parent(x)=y The basic idea is to insert and delete as in a normal binary tree, but then fix up the tree afterwards. Insert: Insert new node x as in binomial tree Color x red While x is not the root and x's parent is red: If x's parent is the left child of its parent: y = x's parent's right child if y is red: Color x's parent black Color y black Color x's grandparent red Move x up two levels (to its current grandparent) else if x is the right child of its parent: x = x's parent left-rotate at x color x's parent black color x's grandparent red right-rotate at x's grandparent Else: (exactly the same thing, mirrored) Color root black. O(lg(n)) CLR 268 Delete: Keep in mind that on the binary tree, leaves have null pointers to their children, but in the red-black tree, the leaves are special nil sentinels colored black. The transcription of the algorithm here is omitted for reasons of time and space. It is easy to look up online or in standard references. Augmentations for Order: Augmentations to find ith largest element, and the rank of an element you have, are presented in CLR chatper 15 (page 281). Augment the data by including the number of nodes in the subtree. Finding the ith largest is obvious. To find the rank of an element you have, move up, adding the value of the left child each time you are a right child of your parent. The information can be maintained without increasing the order of the updating algorithms. Add 1 if you want to count starting at 1. INTERVAL TREES ~~~~~~~~~~~~~~ CLR describes augmenting an RB tree (or presumably a normal BST) for intervals: Store both endpoints of interval in node; order by low endpoint. Augment by storing maximum high endpoint of any entry in subtree. When searching, go left if there is a left child and the maximum of all things in the left subtree is >= the left end of the search interval. CLR 290 This finds ONE interval which overlaps, not ALL, and is thus not the most useful. For a completely different take on interval trees, see http://www.dgp.toronto.edu/people/JamesStewart/378notes/22intervals/ The approach here is to break up the space into "elementary" intervals, which have no interval endpoints within them (thus each new interval creates two new breaks, if the endpoints don't match any other). Notionally, each leaf contains one elementary interval and is annotated with the intervals that contain it, but this is inefficient. Instead, think of internal nodes as "spanning" the union of all their leaves. Store an interval in the highest internal node which is completely inside it; each interval can be stored at most twice at each level. To query all intervals which overlap a number k, simply walk down to the leaf containing k, accumulating all the intervals encountered along the way. B-TREES ~~~~~~~ Designed for use on disk or other slow-access data storage media. B-trees can have a branching factor greater than 2; they are not restricted to be binary. Each node will store one fewer value than the number of its children, and are commonly sized to match a disk page. CLR chapter 19, page 381. Properties: Every node has the number of keys k, the keys, and a flag for being a leaf. Internal nodes contain k+1 pointers to children The keys separate the groups of children Every leaf has the same depth Let t be the minimum degree. Nodes must have t-1 to 2t-1 keys inclusive. The root is exempt from the minimum. Search: Straightforward generalization of tree search. CLR presents search of keys as linear, not binary. Page 388. Create: Create a node. Set it as leaf. Set its number of keys to 0. Write. Split-Child: Inputs: A nonfull internal node x, and index i, and a full node y which is the ith child of x. If y is the root, the tree grows in depth. Move median key from full node to parent. Move second half of keys (and children if applicable) to new node. Insert old median key and new node as child after it into parent. CLR page 390. Insert: This one-pass algorithm splits all full nodes when descending through them, so that we do not have to backtrack later. CLR page 392. Algorithm: If root is full, allocate new root and make existing root its only child. Then call split-child on it. If we are at a leaf: Insert the new key, shifting the others over. Return. If not: Descend to the appropriate child If it is full, split it immediately and move to appropriate new node Recursively call insert at new location Delete: This one-pass algorithm makes sure all nodes have degree at least t (one more than the min) when going through them, to reduce backtracking (but it cannot be completely eliminated). CLR page 395. Arguments: key k to delete, node x (initially the root) Case 1: If key k is in x, and x is a leaf, delete it. Done. Case 2: If key k is in x, and x is an internal node: 2a: If x's child y, preceding key k, has at least t keys, find the predecessor k' of k in tree at y. Recursively delete it, and replace k with k'. 2b: If 2a is not met because there aren't enough keys in y, look at the child on the other side (call it z )instead. 2c: If they both have only t-1 keys, merge k and all of z into y. Delete k and pointer to z from x. Free z and recursively delete k from y. Case 3: If key k is not in x: Find the child which must contain k, assuming k exists. 3a: If this child has only t-1 keys, but a sibling has t or more, shift a key downward from x into the child and replace it with one from its sibling. 3b: If this child and all its siblings have only t-1 keys, merge the child with a sibling, moving a key down from x to be the new median. Then recursively delete on the new child. BINOMIAL TREES ~~~~~~~~~~~~~~ For a tree of order 0, there is one node. For a tree of order k>0, its k-1 children are trees of orders 0 through k-1. The number of nodes must be a power of 2. CLR page 401, but I found its explanation somewhat opaque. HIGHER DIMENSIONAL TREES ~~~~~~~~~~~~~~~~~~~~~~~~ For quadtrees, octrees, and R-trees, see the separate section on geometric algorithms. ============================================================================== GRAPHS ============================================================================== TERMINOLOGY AND REPRESENTATION ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let the graph G have set V of vertices and set E of edges. Standard representations are adjacency lists or adjacency matrix. The list is generally preferred for sparse graphs. For undirected graphs, you can store only the upper or lower diagonal of the matrix, and for unweighted graphs, only one bit per entry is required. BREADTH FIRST SEARCH ~~~~~~~~~~~~~~~~~~~~ Works on directed and undirected graphs which are unweighted. Produces a distance to each vertex and a tree with the shortest path. Uses a FIFO to manage vertices. In this algorithm, the tree is implicitly created from the list of parents. Running time is O(V+E). CLR page 470. Algorithm: Color all vertices white. Set distance to all vertices to infinity. Set parent of each vertex to nil. Put source vertex s at root of tree. Insert s into queue. While queue is not empty: Take vertex u from head of queue For each vertex v adjaced to u: if v is white: color v grey distance[v] = distance[u]+1 parent[v] = u Pop u from the queue Color u black DEPTH FIRST SEARCH ~~~~~~~~~~~~~~~~~~ Multiple trees may be found (a forest). Each node gets two timestamps; when it is greyed and when it is blacked. Time is a global variable. Running time O(V+E). CLR page 478. DFS: Color all vertices white; set predecessor to nil For each vertex: If vertex is white: DFS-visit(vertex) DFS-visit (vertex u): Color u grey set timestamp on disovered_time(u) Increment global time For each v adjacent to u: If v is white: set parent(v)=u DFS-visit(v) color u black set timestamp on finished_time(u) Classification of Edges: Tree edges are edges in the forst (white when found) Back edges are those connecting to an ancestor (gray when found) Forward edges are nontree edges to a descendent (black when found) Cross edges are all others (black when found) For undirected graph, convention is to pick first from list which applies. CLR page 482 Topological Sort: Sorts a DAG such that if there is an edge (u,v), u appears before v. Of course this does not work if it is cyclic. CLR 486. Algorithm: Perform a depth-first search to get finishing times As each vertex is finished, insert it into the front of a linked list Strongly Connected Components: A strongly connected component is one in which each vertex is reachable from the others. Thus this is applicable only to directed, cyclic graphs. We will need the transpose of the graph, which is obtained by reversing the direction of all edges. Proof is nonobvious; CLR page 489. Algorithm: Call DFS on original graph, getting finishing times Compute transpose of graph Call variant DFS on transpose, visiting nodes in order of decreasing finishing time from original DFS Output vertices of each tree in this search as strongly con. component MINIMUM SPANNING TREES ~~~~~~~~~~~~~~~~~~~~~~ The minimum spanning tree is the set of edges which reach all nodes at minimum cost. It is always acyclic. Greedy algorithms are optimal here. CLR 499; HL 363. Trivial "generic" algorithm framework: Set A to null set While A is not a minimum spanning tree: Find an edge (u,v) that is "safe" (in tree) and add it to A Return A Selecting an edge: Cut the graph in a way that does not cut an edge in A ("respects A"). The lowest-cost edge crossing the cut is safe to add to A. Proof CLR page 501, not entirely intuitive to me. Kruskal's Algorithm: A is a forest. Algorithm requires selection of a disjoint-set data structure. CLR 504. O(E lg E) with correct set representation. Algorithm: Set A to null set Make a one-item set of each vertex Sort edges by weight For each edge (u,v) in order of nondecreasing weight: If u and v are not in the same set: Join the set with u and the set with v Add (u,v) to A Prim's Algorithm: A is maintained as a single tree. A priority queue is used to store vertices to add, and gives an example of why we care about a decrease-key operation in heaps. O (E lg V), improvable to O(E + V lg V) with Fibonacci heaps. CLR 509; HL 363. The HL treatment is graphically intuitive but does not address the details of the required data structures and the performance of various choices. Algorithm: Arbitrary starting node argument r Insert all vertices into queue, with infinite key Set r's key in queue to 0 Set r's parent to nil While queue is not empty: Extract minimum value from queue, call it u For each v adjacent to u: if v is in the queue and weight (u,v) is less than v's key: Decrease v's key in the queue to weight(u,v) set parent of v to u SINGLE SOURCE SHORTEST PATH ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some similarities to breadth-first search. Also, maintain similar list of parents/predecessors. Shortest paths may not be unique. If one cares about only a single destination, one can stop the algorithm when it is reached, but it is not possible to do asymptotically better by looking for only one destination than for enumerating all the nodes. Some algorithms allow negative weights as long as there are no negative-weight cycles reachable from the starting node. HL 359 (cursory), CLR chapter 25, p514. We will maintain a bound on the shortest-path weight to each node. We will then relax it (reduce it) if we can reach it with lower cost from one of the adjacent nodes based on *that* node's estimate. Dijkstra's Algorithm: Requires all weights nonnegative. Will maintain set S of vertices whose final weights are determined and a priority queue Q that contains all vertices NOT in S. CLR 527. Queue in linear array: O(V^2+E) = O(V^2) Queue in binary heap: O(E lg V) Queue in Fibonacci heap: O(V lg V + E) The historical development of the Fibonacci heap was motivated by this algorithm, with far more decrease-key than extract-min calls. Algorithm: Set all distances infinite; all parents nil. Set S to empty set. Put all vertices in Q, prioritized on distance While Q is not empty: Remove minimum from Q and store it in u Add u to set S For each vertex v adjacent to u: if distance(v) > distance(u) + weight(u,v): parent(v)=u distance(v) = distance(u) + weight(u,v) Bellman-Ford Algorithm: Allows negative weights; can detect negative-weight cycle. CLR 532; note their loop variable i is unused. Generally slower O(VE) Algorithm: Set all distances infinite; all parents nil. Repeat |V|-1 times: (relaxation loop) For each edge (u,v): if distance(v) > distance(u) + weight(u,v): parent(v)=u distance(v) = distance(u) + weight(u,v) For each edge (u,v): (cycle detection loop) if distance(v) > distance(u) + weight(u,v): throw negative cycle exception Shortest Path in DAGs: Can be significantly optimized by knowing it is acyclic. CLR 536. Algorithm: Topologically sort vertices Set all distances infinite; all parents nil. For each vertex u in topological order: For each vertex v adjacent to u: if distance(v) > distance(u) + weight(u,v): parent(v)=u distance(v) = distance(u) + weight(u,v) SINGLE PAIR SHORTEST PATH ~~~~~~~~~~~~~~~~~~~~~~~~~ These algorithms take advantage of the fact that there is a single destination to try to search the most promising routes first. They are not addressed in CLR or HL, but see also dynamic programming techniques. Best-First Search (A*, A-Star): We need an admissible (does not overestimate) heuristic. In geometric problems, this can be the great-circle distance. The algorithm can be seen as a version of Dijkstra's algorithm with an added heuristic to prune dominated possibilities. It can also be seen as the most general search, with breadth-first, depth-first, and Dijkstra's algorithm all being special cases with certain heuristics. Most significant drawback can be memory usage. Use of "closed list" is optional if a solution is guaranteed to exist or if there is another way of rejecting cycles. If it is used, the graph must be monotonic (e.g. A->B->C costs no less than A->C) - this is obviously satisfied for map routes but might not be the case for arbitrary graphs. G cost: Actual shortest distance from source to node H cost: heuristic (low estimate) cost from node to destination F cost: G+H Algorithm (http://en.wikipedia.org/wiki/A*_search_algorithm): Set closed list to empty set Set open list to source node g(source)=0 while open list is not empty: set x to node with lowest F score if x is dest: return remove x from open list add x to closed list for each y which is a neighbor of x and is not in closed list: temp_g_score = g(x) + weight(x,y) temp_better = 0 if y is not in open list: insert y into open list calculate heuristic h(y) temp_better = 1 else if temp_g_score < g(y): temp_better = 1 if temp_better: parent(y) = x g(y) = temp_g f(y) = g(y) + h(y) Algorithm (http://www.autonlab.org/tutorials/astar.html): Set priority queue empty Set visited list empty Put source into queue with priority F While priority queue is not empty: Remove node with lowest F. Call it n If n is the destination, return success For each m that is a successor of (adjacent to) n: f(m)=g(m)+h(m) = g(n)+weight(n,m)+h(m) if m has not been seen OR m has lower F now than from before OR m is in the priority queue with heigher F: Put m in priority queue with f(m) Add/update visited list to include m,f(m),parent(m)=n else take no action on m If queue is empty and we have not returned already, fail! ALL PAIRS SHORTEST PATH ~~~~~~~~~~~~~~~~~~~~~~~ It is possible to do fundamentally better than running a single-source algorithm for every vertex. CLR chapter 26, page 550. Matrix Algorithm Concept: First construct all shortest paths of length 1. Then recursively add 1 to the maximum number of edges and recalculate until reaching |V|-1. Extend-shortest-path: Input matrix D distances, W weights n = rows(D) D' = new n by n matrix for i in 1 to n: for j in 1 to n: d'(i,j) = infinity for k = 1 to n d'(i,j) = min(d'(i,j), d(i,k)+w(k,j)) Naive-all-pairs-shortest-path: Start with D=W Call extend on D,W n-2 times. To do better, instead of processing by W each time, we can square W, then square that, etc, until reaching a power of 2 which is equal to or greater than |V|-1. We thus get a runtime of O(V^3 lg V) Faster-all-pairs-shortest-path: Start with D=W for(m=1; m*=2; m < n-1): Dnew = extend-shortest-path(D,D) ; note D,D *not* D,W Floyd-Warshall: Achives runtime of O(V^3). Basic concept is to construct shortest paths using vertices {1..k}, and then extend to k+1. d(0,i,j) = w (i,j) d(k,i,j) = min(d(k-1,i,j), d(k-1, i, j) + d(k-1, k, j) It remains to actually construct the paths in addition to calculating the distances. One way is to calculate a matrix of predecessor vertices on the shortest path. It is updated by replacing a value with k when the corresponding min distance changes. CLR page 558. A variant for calculating transitive closures only (reachable nodes) is given on 562. Johnson's: Algorithm for sparse graphs o(V^2 lg V + VE), so faster if E is smaller than V^2. Uses both Bellman-Ford and Dijkstra as components; see CLR 569. MAXIMUM FLOW ~~~~~~~~~~~~ ... ============================================================================== GEOMETRIC ALGORITHMS ============================================================================== LINE SEGMENTS ~~~~~~~~~~~~~ Cross Product: Remember it can be found as the determinant of the matrix: [x1 x2] [y1 y2] = x1y2-x2y1 Direction of Turn: Take a line segment p0-p1 and p0-p2. Translate p0 to the origin. Then calculate the cross product of p1 and p2. If it is positive, p0-p1 is clockwise from p0-p2. Two Line Segment Intersection: First, check whether bounding boxes intersect: max(x1,x2) >= min(x3,x4) AND max(x3,x4) >= min(x1,x2) AND max(y1,y2) >= min(y3,y4) AND max(y3,y4) >= min(y1,y2) Then, if this is true, check line segments for straddling each other with cross products. They intersect if p1-p3 and p1-p4 have different orientations relative to p1-p2: sign((p3-p1) X (p2-p1)) != sign((p4-p1) X (p2-p1)) If either of the cross products are exactly zero, the endpoint of one is on the other. If they're both zero, they are colinear. CLR 888 Many Line Segment Intersection: Do any to segments in an arbitrarily large set intersect? O(n lg n) Simplified algorithm assumes no vertical segments and no three segments intersecting at one point. CLR 892. The idea is to move a "sweep line" left to right (or any other direction). Each time an endpoint is encountered, check to see whether any of the segments which are intersected by the sweep linehave changed order relative to each other. If so, they crossed. CONVEX HULLS ~~~~~~~~~~~~ Graham's Scan: O(n lg n). CLR 900. Uses a stack to keep track of verticies which are tentatively on the convex hull. Start with p0 as the bottommost point (or leftmost, etc) Sort points p1 through pn by angle to p0 in clockwise order Push p0, p1, p2 onto stack for i in 3 to n: while the angle from 2nd-from-top, top-in-stack, and pi is nonleft: pop from stack push pi onto stack return points in stack Jarvis' March: O(nh) where h is the number of points on the hull. Start with lowest point. Calculate the polar angle of all the other points to this one. Pick the one with the lowest value. Then, recalculate with regard to that point. Repeat until getting back to the original one. Envision it as sweeping an scan arm in until it hits a point each time, then moving to that point and repeating. CLR 906 advocates an approach of keeping a separate left and right side for simplicity in calculating angles. 2D POINTS ~~~~~~~~~ Closest Pair of Points: CLR 908 LOCALITY SENSITIVE HASHING ~~~~~~~~~~~~~~~~~~~~~~~~~~ Probabilistic assignment of similar items to the same buckets. It is primarily used in high-dimensional data; it is asserted that by 10-20 dimensions, k-d trees and friends are searching over a significant fraction of the space (Gionis, Indyk, Motwani 1999) http://www.mit.edu/~andoni/LSH/ BSP (BINARY SPACE PARTITION TREE) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Overview: Interior nodes represent splitting lines (or plans, hyperplanes, etc in higher dimensions), and also contain objects which are entirely within the splitting plane. Leaf nodes are convex and contain 0 or 1 objects or object fragments (since objects can be split). A BSP which is built only using splitting lines which are extensions of input segments (or planes which are extensions of input polygons) is called "auto-partition (BCKO 262). This can be used with the painter's algorithm to select which things to paint (BCKO 263). Other canonical uses include shadow generation, set operations on polyhedra, cell decomposition methods for motion planning, and as an index in GIS (BCKO 278 has references for each). Both K-D trees and quad/octrees are a special case of BSP trees. In Practice: The size of the tree is O(n+2n*ln(n)) in 2D and O(n^2) in 3D space on disjoint triangles. In practice, however, they perform well on low density data. Thus, define the idea of density - the smallest lambda such that any ball B intersectis at most lambda objects who have a diameter greater than or equal to that of B (i.e. small objects are not counted). (BCKO 271). However, the randomized auto-partition algorithm does not perform well in the low-density case. Define a "guard" as the corner of an axis-aligned square containing an object. Split initially as if in a quad tree, but then if one quadrant contains too many guards, shrink towards a corner using a heuristic k. The details are given in BCKO. For disjoint line segments in the plane, there is a tree of size O( log(lambda)), where lambda is the density. For triangles in 3D, this is O(n lambda), which varies from n to n^2 as lambda varies from 1 to n. K-D Trees ~~~~~~~~~ K-D trees are a special case of BSP trees. Each node is a point in K-dimensional space, and each cutting hyperplane is in one of the K dimensions. Basic Case: Assume that no two points exactly match in x or y. Split first on x, then split each of these on y, then on x, etc. Split on the median point in the relevant dimension. Finding the median is the most expensive part; presort the points in both x and y to facilitate this. To query for points within a rectangle, recurse down the tree, following only children which intersect the rectangle, checking the points in regions which are not entirely within the rectangle, and taking the union of their results. For m answers, the time to do the query is O(sqrt(n) + m) (BCKO 104). QUADTREES ~~~~~~~~~ Quadtrees are also a special case of BSP trees. Basic Version for Points: Split until each square contains at most one point. Make an arbitrary decision about <= vs. >=. It is possible to store the size of each square at the nodes, or recompute it while walking the tree. (BCKO 309) The tree can be quite unbalanced, and the height is not a trivial function of the number of points. However, it is at most log(s/c) + 3/2, where c is the smallest distance between any two points and s is the side length of the initial square (BCKO 311) Finding Neighbors: Let's say we want to find the north neighbor. As long as we are NOT one of the south neighbors of our parent, move up. Then report the northern neighbor from us. If we want a leaf, we have to them walk back down, always choosing the souther neighbor. If we hit the root while going up, we are at an edge and there is no neighbor in the desired direction. (BCKO 313) Balanced Quadtree: Take the quadtree as balanced if any two neighboring squares differ by at most 2x in size. To balance a tree, put all the leaves into a list. For each leaf, find neighbors in all four directions and see if they are to small. If so, the leaf has to be split. This might cause its neighbors that now need to be split, so insert them into the list if so. The balanced version does not have asymptotically more nodes, and can be constructed in O((d+1)m) time (BCKO 315). More Information: BCKO 315 talks about mesh generation from quadtrees. S. Aluru, "Quadtrees and Octrees" discusses compression. Eppstein et al "The skip quadtree" combines with skiplists for efficient insertion and removal. R-TREES ~~~~~~~ Source for this section is original paper: http://www.cs.fiu.edu/~vagelis/classes/COP6727/publications/gutman-rtree.pdf "R-Trees: A Dynamic Index Structure for Sptial Searching" by Antonin Guttman Overview: The R stands for Rectangle, which is used as a bounding box. R-trees are designed for paged storage, and resemble B-trees in this regard. Each node has a maximum number of entries. Each entry in a non-leaf node stores pointers to children and the bounding box of each child. Individual entries can have nonzero size - they are rectangles, not points. Each node has a maximum of M entries and a minimum of M/2 or less. The root is exempt from the minimum. Search: The input to search is a query box. Start at the root, and move down into those children which overlap with the query box. Because more than one subtree may need to be visited, it is not possible to guarantee good worst-case performance. Insertion: Start at the root. Choose the rectangle which requires the least enlargement to hold the new entry. At the leaf, insert if the leaf is not full, or split otherwise. Adjust: Start at a leaf N. If the leaf was just split, save the second one. Move up to the parent and adjust the coverage rectangle to reflect If there is a second node NN saved from the split, try to add it to the parent as well. If the parent is now overfull, split the parent as well and repeat. Splitting: The M+1 items must be divided into 2 new nodes. It is desirable to divide them such that the overall area is lowest, but it is prohibitive to exhaustively check all combinations. Several algorithms are described in the paper, both linear and quadratic with respect to M; the conclusions say the linear algorithm is good enough, and the slightly worse quality of the splits is not significant. For this reason, I will describe only the linear algorithm. Along each dimension, find the rectangles with the highest low side and the lowest high side; record the separation. Normalize by the size of the entire set in that dimension. Choose the most extreme pair in any dimension. These are the two "seeds". Then, choose any of the remaining entries, and add it to the set which is enlarged the least to include it. If you get to a point where all entries are done, or where one set must have the remainder of the available entries assigned to it to hit the minimum, do so and stop. ============================================================================== DYNAMIC PROGRAMMING ============================================================================== Overview: Dynamic programming is a recursive technique in which a subproblem can be solved and then used to solve the larger problem. It is applicable to problems like single pair shortest path (HL 425) or ordering matrix multiplication (CLR 302). Traditionally, the dynamic programming problem is computed bottom-up instead of actually recursing. In the matrix case, compute all the subsequences of length 2, then use those to compute all the length 3, and so on. HL suggests that dynamic programming is applicable where the problem has multiple stages with a state and a policy decision, the costs of each stage of the problem is independent of the others, and there is a recursive relationship to find the cost of stage n+1 using the cost of stage n (HL 430). CLR puts the same thing a bit differently, referring to optimal substructure and overlapping subproblems (CLR 309). The latter corresponds to a recursive solution needing to solve the same subproblems repeatedly. "Memoization" - basically caching the results of the subproblems - provides much of the benefit while allowing for the traditional top-down approach instead of bottom-up. Other Examples: CLR 314: longest common subsequence. Define the subproblems in terms of sequence prefixes, of which there are only m*n for strings of length m and n. CLR 320: optimal polygon triangulation. HL 453: probabilistic dynamic programming with expected values ============================================================================== LINEAR PROGRAMS ============================================================================== CLR 539 (short). Strang chapter 8. HL in depth. SIMPLEX METHOD ~~~~~~~~~~~~~~ Geometric Interpretation (HL 82): Start at origin Move to a better adject corner point feasible solution (select the direction based on greatest slope) When no adjacent CPF solutions are better, stop Slack Variables (HL 87): Introduced in order to make all constraints an equality New problem is called the augmented form Negative slack variable indicates infeasible solution Basic solution: Augmented corner point solution Basic feasible solution: Augmented corner point feasible (CPF) solution Algorithmic Interpretation: Number of basic variables is number of equations Non-basic variables are the others Set nonbasic variables to 0 Solve for basic variables To look at next solution, change on basic variable Algorithm HL92; tabular form HL95. Also Strang ============================================================================== NONLINEAR PROGRAMS ============================================================================== ============================================================================== INTEGER PROGRAMS ============================================================================== ============================================================================== MATRIX ALGORITHMS ============================================================================== LU DECOMPOSITION ~~~~~~~~~~~~~~~~ Use Gaussian elimination to turn original matrix A into an upper diagonal matrix U. The coefficients of the rows which were subtracted go into L. This gives A=LU. Symmetric matrices can be decomposed further into A=LDL' (' is transpose here). Create D by factoring out a diagonal matrix from U such that the pivots are all 1. What's left will be L'. POSITIVE DEFINITE MATRICES ~~~~~~~~~~~~~~~~~~~~~~~~~~ Positive definite matrices indicate the presence of a minimum. 1. A matrix with positive pivots - which require a positive diagonal and positive determinant - is a positive definite matrix. 2. A is positive definite if x'Ax=0 for all nonzero x. Strang 18 ============================================================================== AUCTION ALGORITHMS ============================================================================== ... section pending ... ============================================================================== Well, that was my file of notes on algorithms. Parts of it may not be finished yet. It may not be very useful to you, because it is so abbreviated, but it is useful to *me* because it is abbreviated in a way that corresponds well with the way things are already stored in my memory. Notation: O() is used for an upper bound, Omega for a lower bound, and Theta for both. lg is a base-2 logarithm. CLR: Cormen, Leiserson, Rivest. Introduction to Algorithms. 1st edition HL: Hillier and Lieberman. Introduction to Operations Research. 6th ed BCKO: de Berg, Cheong, van Kreveld, Overmars, Computational Geometry, 3rd Ed Strang: Gilbert Strang, Introduction to Applied Mathematics To print this file nicely: enscript -B -o alg.ps -1 algorithms.ref psnup -c -4 -b-.25in -m.25in alg.ps alg2.ps lpr alg2.ps vim: ai