_ _ _ _ _
/ \ | | __ _ ___ _ __(_) |_| |__ _ __ ___ ___
/ _ \ | |/ _` |/ _ \| '__| | __| '_ \| '_ ` _ \/ __|
/ ___ \| | (_| | (_) | | | | |_| | | | | | | | \__ \
/_/ \_\_|\__, |\___/|_| |_|\__|_| |_|_| |_| |_|___/
|___/
==============================================================================
SORTING ALGORITHMS
==============================================================================
Selection Sort:
Always O(n^2) regardless of data
Suitable on linked lists where removal and addition are cheap
Generally dominated by insertion sort, which does half as many comps on avg
Only advantage is when swaps are expensive; requires O(n) not O(n^2) for ins
Primarily of historical interest
Algorithm:
Find the smallest element
Swap it with the first element
Repeat until done
Insertion Sort:
Desirable when data are already almost sorted; otherwise O(n^2)
Stable, in-place
Fewer comparisons than selection sort but more writes
May be faster than quicksort or mergesort for smallish number of elements
When comparisons are expensive, a variant with binary search can be used
CLR page 8
Algorithm:
Start at item 2
Move the item left until it is not smaller than the item left of it
Repeat until done
Shell Sort:
Variant of Insertion Sort which moves in decreasing gaps.
Running time order is a function of gap series.
O(n^4/3) and O(n*lg(n)^2) is known.
There is a proof that n*lg(n) cannot be achieved.
For gap k, sorts 1st, kth, 2kth, elements. Then sorts 2nd, k+1th, etc.
A favorable sequence is 1, 4, 10, 23, 57, 132, 301, 701, 1750.
Merge Sort:
Algorithm:
If there is more than one element:
Sort first half (recursive call)
Sort second half (recursive call)
Merge two halves
Notes:
Requires additional storage (not in-place). Recursive.
CLR page 13
Heap Sort:
O(n lg n). Generally dominated by quick sort in practice.
See section on heaps for details.
Quick Sort:
O(n^2) worst case, but O(n lg n) average case, and constant factors low.
In-place memory usage. Good performance requires balanced partitioning.
Naive algorithm has worst-case time on already sorted input by using
initial element as the pivot.
CLR page 153; entirety of Chapter 8.
Quicksort:
If input array is greater than one element
Partition it, getting divider as return value
Run quicksort on each side
Partition:
Set pivot = first element (other more intelligent methods possible)
Take two indices, one at the front and one at the back.
Advance the front index until it points to an element >= pivot
Retreat the back index until it points to an element <= pivot
If the indices have crossed, return index location
Otherwise, swap elements and repeat
Radix Sort:
Sort on least significant digit first (an O(n) operation). Then
next least, etc. Of course a stable sort must be used. Not in-place.
Note that time is a function of the size of the numbers and not just
the number of items.
CLR page 178.
Bucket Sort:
Assumes input is randomly distributed over an interval. Then divides
this interval into n buckets and puts the n input numbers into them.
Use any kind of sort within the bucket, since the number of items will
be small. A linked list for buckets is common.
If the input is *not* evenly distributed, the within-bucket sort
can dominate.
CLR page 180.
==============================================================================
PARTIAL SORTING/SELECTION ALGORITHMS
==============================================================================
Min and max:
There is a cute trick to reduce the number of comparisons when finding
both min and max. Compare each two elements pairwise, and then compare
only the larger with the max and the smaller with the min. This gets
you 3/2n comparisons instead of 2n. CLR page 186.
Selection Algorithm for Ith Largest Element:
Proceed like a (randomized) quicksort, but each time recurse only on the
half which will contain the desired answer. CLR page 187.
There is further discussion on ways of guaranteeing a good pivot.
==============================================================================
MISCELLANEOUS DATA STRUCTURES
==============================================================================
STACKS, QUEUES
~~~~~~~~~~~~~~
CLR chapter 11. Queues generally implemened in a ring buffer.
Also consider deques, not in CLR.
LINKED LISTS
~~~~~~~~~~~~
Remember that you can make a ring buffer with a sentinel between
the beginning and end, which is sometimes convenient. Consider the merits
of singly vs. doubly linked lists. Insertion and deletion are cheap;
indexing is expensive.
Skip lists:
The bottom layer is an ordinary linked list. Each higher layer
contains an element with some probability, often 1/2 or 1/4.
The expected search time is O(log N); P can be varied to trade
storage space against search time.
They have properties comparable to trees with random insertions, but
without requiring insertions to be random.
The best place to start a search is the level where we expect 1/p
nodes, or log_1/p(n). However, starting at the highest level is
also generally OK and adds only a small constant factor.
It is also suggested that the level of a new node could be constrained
to be at most 1 higher than the existing maximum level, which works
well in practice but is difficult to analyze.
The absolute fastest search is with p=1/e, but it is only 94% of the 1/2
and 1/4 cases. 1/4 is just as fast on average as 1/2, with less storage,
but the variability is higher.
See ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf
DISJOINT SETS
~~~~~~~~~~~~~
CLR page 440. Used in Kruskal's algorithm for minimum spanning trees.
The set has a representative item, which may be arbitrary or may have a
specific algorithm for its selection. The operations make a set from a
singleton, unite sets, and find the set containing an item.
Linked List Representation:
Representative is first object. Each object contains a pointer
back to the representative. Thus find-set is trivial because it
gets the representative from the object (note that you must have
the set object, not a copy of the key). Union then just appends
one linked list to the other, but it must update the representative
pointers, taking O(m) time. As a heuristic for a minor improvement,
always add the smaller set to the larger. For m operations, of
which n are make-set, this gives O(m+n lg n); CLR 445.
Forest Representation:
Each tree is a set. Each member points only to parent, not to root
of tree. Find-set then requires walking up the tree to the root.
Union points the root of one tree at the root of the other tree (they're
not binary trees). This by itself is not necessarily an improvement,
as no assertions about favorable height of trees can be made; they
could be degenerate (linked lists).
Union by Rank:
Make the root of the tree with fewer nodes point at the root of the
tree with more nodes instead of the other way around (CLR 447). To do
this efficiently, we must maintain the rank as we go instead of trying
to compute it when doing the union.
The rank is an upper bound on the height, but may conceivably
overestimate it. If the ranks of the trees being joined are exactly
equal, the rank of the new root will go up by 1.
Path Compression:
When performing a find, update all nodes on the path to point directly
to the root. Ranks records are unchanged, and thus could conceivably
become an overestimate. The compression is a two-pass procedure.
Running Time:
With both heuristics, O(m alpha(m,n)) for m operations on n elements.
Alpha is the Ackermann function inverse, <= 4. Proof is difficult.
CLR p 453 presents a weaker proof.
HASH TABLES
~~~~~~~~~~~
Overview:
O(1) if you get it right, but of course O(n) if your hash sucks and
everything goes in one bin. In Chaining, singly linked list is used to
store multiple elements in one bin. A doubly-linked list makes deletion
cheap if you already have a pointer to the element.
Consider a table with n elements and m slots; the hash factor alpha
is n/m. An unsuccessful search is then theta(1+alpha) on average.
Alternatively, you can store everything in the table itself, by moving
onward when you have a collision (linearly, quadratic, double
hashing, etc). Note that special behavior is required on deletion
so that subsequent values are not hidden by the empty slot stopping the
search. CLR prefers double-hashing, which generates the most sequences,
and describes it on page 235.
Hash Functions:
CLR has a very short discussion of hash functions on page 226.
It is common to interpret the key as a natural number, such as by taking
a string as a number in some radix. One simple approach is dividing this
number by a prime which is not too close to a power of 2 (CLR 228) and
taking the remainder. The second is to multipley by a floating point
number A and then multiply by an integer m, often a power of 2. Without
explanation, (sqrt(5)-1)/2 is suggested for A, with a reference to Knuth.
For double hash functions to support open addressing, an example is:
m is prime, m' is slightly less than m
h1(k) = k mod m
h2(k) = 1 + (k mod m')
h(k,i) = (h1(k) + ih2(k)) mod m
Birthday Problem:
As a very rough approximation, there is ~50% probability of a collision
once there are 2^(N/2) things in an N bit hash.
A usable approximation is
p(n) = 1 - e^(-n^2/(2*size))
For n=2^32 and size=2^64, this gives p=1-e^(-.5) = 0.4
Working this the other way around, 0.5 = e^(-0.6931), or
n^2 = 1.39 * size
so for a 64-bit hash, 2^(32.24) things result in 50% collision probability.
http://en.wikipedia.org/wiki/Birthday_problem has several other formulae
Consistent Hashing:
From the paper "Consistent Hashing and Random Trees". Assume that each
person can see at least a constant fraction of the buckets which exist.
The desired properties are balance (even assignment of items to visible
buckets), monotonicity (when new buckets become visible, things may move
to the new ones, but not one of the old ones), spread (the number of
different buckets to which something gets assigned should be small),
and load (the buckets should all have the same number of items).
Suppose rb and rx are random functions. rb maps buckets to the unit
interval and rx maps items. fv(i) is the bucket b in V which minimizes
the distance between rb(b) and rx(i). If there are at most C buckets
across all views, each bucket is actually replicated k*log(C) times, which
are then randomly mapped with rb().
A simple implementation uses a binary search tree to store the
correspondence between unit interval segments and buckets. There are
k*C*log(C) intervals, so the depth is log(C); inserting or removing k
points takes log^2(C).
To get the hash computation time down to O(1), divide the interval into
k*C*log(C) equal length segments, and keep a separate tree for each,
with expected size O(1). Since these must shrink as buckets are added,
use intervals of length 1/2^x <= 1/(k*C*log(C), choosing the largest x.
Then divide them in half. It is also important that the expected length
of a run of empty intervals is small, since they must all be updated when
a bucket is added.
==============================================================================
HEAPS
==============================================================================
CLR page 401:
Procedure Binary Heap Binomial Heap Fibonacci Heap
(worst) (worst) (amortized)
Make-Heap 1 1 1
Insert lg n lg n 1
Minimum 1 lg n 1
Extract-Min lg n lg n lg n
Union n lg n 1
Decrease-Key lg n lg n 1
Delete lg n lg n lg n
BINARY HEAPS
~~~~~~~~~~~~
The heap is organized such that each child is less than its parent.
It can be stored in an array, with the first row being element 1,
the second row being elements 2 to 3, etc. Thus, the left side must
be populated first.
Generally, the heap sort is dominated by the quick sort, but heaps
are more useful as priority queues.
CLR chapter 7.
Heapify:
Input is a location and heap. Assumes children are proper heaps, but
item itself might be too small.
O(n lg(n)) (from height of heap)
Algorithm:
If there is a left child:
Set largest= the larger of self and left child top
If there is a right child:
If top of right child > largest
largest=top of right child
If largest is not self
Exchange
Recursively call on new largest
Build-Heap:
Repeatedly calls heapify to make a heap from an unordered array.
Start at the element which is half the length, and call heapify
on everything from there decreasing to the first element.
Running time is O(n), proven by showing that the running time
of heapify is lower at nodes close to the bottom.
CLR page 145
Heap Sort:
O (n lg n)
Build the heap
For i in last-element to element 2:
exchange first and ith element (puts largest element last)
Decrease global heap size by 1
heapify starting at first element (puts new largest element first)
Heap Insert:
O(lg n)
Put new element in at end. Bubble it up until it is larger than its
children.
More formally:
Increase length by 1
i=heap size
while i>1 and parent of element i < new value
set element i = element i's parent
i = parent(i)
element i = new value
Heap Extract:
O(lg n)
Take off first element
Move last element to front
Rerun heapfify()
BINOMIAL HEAPS
~~~~~~~~~~~~~~
Binomial heaps are interesting when we need to take the union of them, which
can be done in lg(n) instead of n. Otherwise, just use ordinary binary
heaps. CLR chapter 20, page 400.
A binomial heap is made up of binomial trees. There is at most one tree
of each degree. The trees are heap-ordered - the key of a node is greater
than or equal to its parent. (isn't this the opposite convention from
that used in describing binary heaps?) See also section on binomial trees.
Representation: Keeps track of parent, left child, and right sibling, and notes
own degree. Roots of trees are in singly linked list in increasing degree,
giving an alternative meaning to the sibling field.
Make-binomial-heap:
Creates an empty heap with one null pointer.
Minimum:
The minimum will be at the root of a tree, so just search the initial
linked list, which will be of length order lg(n).
Binomial-Link (child y,parent z):
Set y'z parent to z
Set y's sibling to z's child
Set z's child to y
Increment z's degree
Union (a, b):
We will first merge the root lists. This could leave as many as two
trees of each degree, and we need at most one. Thus, for each two that
are the same degree, we merge them into one of the next higher degree,
working our way upward. It gets a little complicated because after
joining two trees, we might temporarily have THREE of the same degree,
and we have to leave the first as-is and merge the last two.
e.g. 1,1,2,2,3 -> 2,2,2,3 -> 2,3,3 -> 2,4
Time proportional to number of trees, or lg(n)
Details:
Create a new heap H.
Merge a and b into H, joining root lists in monotonically inc. degree order
if H is empty, return
Set prevx to null
set x to head of H
set nextx to x's sibling
while nextx:
if degree(x) != degree(nextx) ;1 of this degree
OR (sibling(nextx) exists AND degree(sibling(nextx))==degree(x)) ;3
; Don't need to merge. Advance
prevx = x
x = nextx
else if key(x) <= key(nextx)
; make nextx a child of x
sibling(x)=sibling(nextx)
binomial-link(nextx, x)
else
if not prevx:
head = nextx
else
sibling(prevx)=nextx
binomial-link(x, nextx)
x=nextx
nextx = sibling(x)
Insert:
Make a new heap to contain the element.
Set its parent, child, and sibling to null
Set its degree to 0
Union(existing heap, new one element heap)
Extract-min:
Find the tree with the smallest root
Make a new heap out of the remainder of this tree:
The linked list will be the root's children in reverse order
Unite the old heap (less the removed tree) and the new heap.
(CLR page 413, with illustrative figure)
All O(lg n), proportional to number of trees
Decrease-Key:
Error if key has gone up
Otherwise, bubble node up until it is not smaller than parent
Delete:
Decrease key to -infinity
Extract minimum (cute eh?)
FIBONACCI HEAPS
~~~~~~~~~~~~~~~
Insert, minimum, and union are O(1). Extract-min and delete are O(lg n).
Decrease-key is O(1) amortized, significant as a basis in graph algorithms
(see for example Dijkstra's algorithm). CLR chapter 21, p420.
...
==============================================================================
TREES
==============================================================================
Most operations are O(h), where h is height. If the tree is balanced,
this is lg(n), but it can be n if the tree is degenerate (think of
inserting in sorted order).
BINARY TREES
~~~~~~~~~~~~
Consider how to store equal values - CLR lets them be the left or right child.
If you do not store pointers to parent, successor (and thus deletion) may be
problematic.
Recursive search is trivial; so are min and max.
Search:
If equal, return
If empty, return not found
If value less than current node, search left child
Otherwise search right child
Successor:
If right child is not empty, return its minimum.
Otherwise, move to parent until you find a parent to whom you are the
left child instead of the right.
O(h). Requires ability to move up to parent.
Insertion:
Like a search, but when you get to the bottom, you insert a new node.
Deletion:
One of the more complex operations. CLR page 251.
Algorithm:
Node z is to be removed; p is its parent.
If z has no children:
Modify p to replace z with a null
If z has one child:
Make this child a direct child of p
If z has two children:
Take z's successor y (which has no left child - this is always true)
Remove y
Replace z with y
RED-BLACK TREES
~~~~~~~~~~~~~~~
Stores one bit of color on each node. Assurance is that no path is more
than twice as long as any other.
Coloring Invariants:
Every node is either red or black
Every leaf is black and is NIL
If a node is red, both its children are black
Every simple path to a leaf contains the same number of black nodes
(some definitions also give the root as always black. CLR does not.)
Max height 2*lg(n+1); proof CLR page 264. General idea is that you
cannot have two red nodes in a row, so the shortest is all black and
the longest alternates.
Other variants exist in which leaves are not always NIL. This changes some
of the fundamentals and is considered more complex. There are also variants
in which the edges are spoken of as colored instead of notes. Analgous
to B-trees of order 4 (discussed on wp article)
Basic operation for restoring red-black property after insertions and
deletions is "rotation":
y x
x/ \g <------> a/ \y
a/ \b b/ \g
Left-Rotate:
y = right(x)
right(x) = left(y)
if left(y)
parent(left(y))=x
parent(y)=parent(x)
if x has no parent:
root=y
else
if x is its parent's left child
left(parent(x))=y
else
right(parent(x))=y
left(y)=x
parent(x)=y
The basic idea is to insert and delete as in a normal binary tree, but then
fix up the tree afterwards.
Insert:
Insert new node x as in binomial tree
Color x red
While x is not the root and x's parent is red:
If x's parent is the left child of its parent:
y = x's parent's right child
if y is red:
Color x's parent black
Color y black
Color x's grandparent red
Move x up two levels (to its current grandparent)
else
if x is the right child of its parent:
x = x's parent
left-rotate at x
color x's parent black
color x's grandparent red
right-rotate at x's grandparent
Else:
(exactly the same thing, mirrored)
Color root black.
O(lg(n)) CLR 268
Delete:
Keep in mind that on the binary tree, leaves have null pointers to their
children, but in the red-black tree, the leaves are special nil
sentinels colored black.
The transcription of the algorithm here is omitted for reasons of time
and space. It is easy to look up online or in standard references.
Augmentations for Order:
Augmentations to find ith largest element, and the rank of an element you
have, are presented in CLR chatper 15 (page 281). Augment the data by
including the number of nodes in the subtree. Finding the ith largest is
obvious. To find the rank of an element you have, move up, adding the
value of the left child each time you are a right child of your parent.
The information can be maintained without increasing the order of the
updating algorithms.
Add 1 if you want to count starting at 1.
INTERVAL TREES
~~~~~~~~~~~~~~
CLR describes augmenting an RB tree (or presumably a normal BST) for
intervals:
Store both endpoints of interval in node; order by low endpoint. Augment
by storing maximum high endpoint of any entry in subtree. When
searching, go left if there is a left child and the maximum of all
things in the left subtree is >= the left end of the search interval.
CLR 290
This finds ONE interval which overlaps, not ALL, and is thus not the
most useful.
For a completely different take on interval trees, see
http://www.dgp.toronto.edu/people/JamesStewart/378notes/22intervals/
The approach here is to break up the space into "elementary" intervals,
which have no interval endpoints within them (thus each new interval
creates two new breaks, if the endpoints don't match any other).
Notionally, each leaf contains one elementary interval and is annotated
with the intervals that contain it, but this is inefficient. Instead,
think of internal nodes as "spanning" the union of all their leaves.
Store an interval in the highest internal node which is completely
inside it; each interval can be stored at most twice at each level.
To query all intervals which overlap a number k, simply walk down to the
leaf containing k, accumulating all the intervals encountered along
the way.
B-TREES
~~~~~~~
Designed for use on disk or other slow-access data storage media. B-trees
can have a branching factor greater than 2; they are not restricted to be
binary. Each node will store one fewer value than the number of its
children, and are commonly sized to match a disk page. CLR chapter 19, page
381.
Properties:
Every node has the number of keys k, the keys, and a flag for being a leaf.
Internal nodes contain k+1 pointers to children
The keys separate the groups of children
Every leaf has the same depth
Let t be the minimum degree. Nodes must have t-1 to 2t-1 keys inclusive.
The root is exempt from the minimum.
Search:
Straightforward generalization of tree search. CLR presents search
of keys as linear, not binary. Page 388.
Create:
Create a node. Set it as leaf. Set its number of keys to 0. Write.
Split-Child:
Inputs: A nonfull internal node x, and index i, and a full node y which
is the ith child of x. If y is the root, the tree grows in depth.
Move median key from full node to parent. Move second half of keys
(and children if applicable) to new node. Insert old median key
and new node as child after it into parent. CLR page 390.
Insert:
This one-pass algorithm splits all full nodes when descending through them,
so that we do not have to backtrack later. CLR page 392.
Algorithm:
If root is full, allocate new root and make existing root its only child.
Then call split-child on it.
If we are at a leaf:
Insert the new key, shifting the others over. Return.
If not:
Descend to the appropriate child
If it is full, split it immediately and move to appropriate new node
Recursively call insert at new location
Delete:
This one-pass algorithm makes sure all nodes have degree at least t
(one more than the min) when going through them, to reduce backtracking
(but it cannot be completely eliminated). CLR page 395.
Arguments: key k to delete, node x (initially the root)
Case 1:
If key k is in x, and x is a leaf, delete it. Done.
Case 2:
If key k is in x, and x is an internal node:
2a: If x's child y, preceding key k, has at least t keys, find the
predecessor k' of k in tree at y. Recursively delete it, and
replace k with k'.
2b: If 2a is not met because there aren't enough keys in y, look
at the child on the other side (call it z )instead.
2c: If they both have only t-1 keys, merge k and all of z into y.
Delete k and pointer to z from x. Free z and recursively
delete k from y.
Case 3:
If key k is not in x:
Find the child which must contain k, assuming k exists.
3a: If this child has only t-1 keys, but a sibling has t or more,
shift a key downward from x into the child and replace it with
one from its sibling.
3b: If this child and all its siblings have only t-1 keys, merge the
child with a sibling, moving a key down from x to be the new median.
Then recursively delete on the new child.
BINOMIAL TREES
~~~~~~~~~~~~~~
For a tree of order 0, there is one node.
For a tree of order k>0, its k-1 children are trees of orders 0 through k-1.
The number of nodes must be a power of 2.
CLR page 401, but I found its explanation somewhat opaque.
HIGHER DIMENSIONAL TREES
~~~~~~~~~~~~~~~~~~~~~~~~
For quadtrees, octrees, and R-trees, see the separate section on geometric
algorithms.
==============================================================================
GRAPHS
==============================================================================
TERMINOLOGY AND REPRESENTATION
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let the graph G have set V of vertices and set E of edges.
Standard representations are adjacency lists or adjacency matrix. The
list is generally preferred for sparse graphs. For undirected graphs,
you can store only the upper or lower diagonal of the matrix, and for
unweighted graphs, only one bit per entry is required.
BREADTH FIRST SEARCH
~~~~~~~~~~~~~~~~~~~~
Works on directed and undirected graphs which are unweighted. Produces a
distance to each vertex and a tree with the shortest path. Uses a FIFO to
manage vertices. In this algorithm, the tree is implicitly created from the
list of parents. Running time is O(V+E). CLR page 470.
Algorithm:
Color all vertices white.
Set distance to all vertices to infinity.
Set parent of each vertex to nil.
Put source vertex s at root of tree.
Insert s into queue.
While queue is not empty:
Take vertex u from head of queue
For each vertex v adjaced to u:
if v is white:
color v grey
distance[v] = distance[u]+1
parent[v] = u
Pop u from the queue
Color u black
DEPTH FIRST SEARCH
~~~~~~~~~~~~~~~~~~
Multiple trees may be found (a forest). Each node gets two
timestamps; when it is greyed and when it is blacked. Time is
a global variable. Running time O(V+E). CLR page 478.
DFS:
Color all vertices white; set predecessor to nil
For each vertex:
If vertex is white:
DFS-visit(vertex)
DFS-visit (vertex u):
Color u grey
set timestamp on disovered_time(u)
Increment global time
For each v adjacent to u:
If v is white:
set parent(v)=u
DFS-visit(v)
color u black
set timestamp on finished_time(u)
Classification of Edges:
Tree edges are edges in the forst (white when found)
Back edges are those connecting to an ancestor (gray when found)
Forward edges are nontree edges to a descendent (black when found)
Cross edges are all others (black when found)
For undirected graph, convention is to pick first from list which applies.
CLR page 482
Topological Sort:
Sorts a DAG such that if there is an edge (u,v), u appears before v.
Of course this does not work if it is cyclic. CLR 486.
Algorithm:
Perform a depth-first search to get finishing times
As each vertex is finished, insert it into the front of a linked list
Strongly Connected Components:
A strongly connected component is one in which each vertex is
reachable from the others. Thus this is applicable only to
directed, cyclic graphs. We will need the transpose of the graph,
which is obtained by reversing the direction of all edges. Proof
is nonobvious; CLR page 489.
Algorithm:
Call DFS on original graph, getting finishing times
Compute transpose of graph
Call variant DFS on transpose, visiting nodes in order of
decreasing finishing time from original DFS
Output vertices of each tree in this search as strongly con. component
MINIMUM SPANNING TREES
~~~~~~~~~~~~~~~~~~~~~~
The minimum spanning tree is the set of edges which reach all nodes at minimum
cost. It is always acyclic. Greedy algorithms are optimal here. CLR 499; HL
363.
Trivial "generic" algorithm framework:
Set A to null set
While A is not a minimum spanning tree:
Find an edge (u,v) that is "safe" (in tree) and add it to A
Return A
Selecting an edge:
Cut the graph in a way that does not cut an edge in A ("respects A"). The
lowest-cost edge crossing the cut is safe to add to A.
Proof CLR page 501, not entirely intuitive to me.
Kruskal's Algorithm:
A is a forest. Algorithm requires selection of a disjoint-set data
structure. CLR 504. O(E lg E) with correct set representation.
Algorithm:
Set A to null set
Make a one-item set of each vertex
Sort edges by weight
For each edge (u,v) in order of nondecreasing weight:
If u and v are not in the same set:
Join the set with u and the set with v
Add (u,v) to A
Prim's Algorithm:
A is maintained as a single tree. A priority queue is used to store
vertices to add, and gives an example of why we care about a
decrease-key operation in heaps. O (E lg V), improvable to O(E + V lg
V) with Fibonacci heaps. CLR 509; HL 363. The HL treatment is
graphically intuitive but does not address the details of the required
data structures and the performance of various choices.
Algorithm:
Arbitrary starting node argument r
Insert all vertices into queue, with infinite key
Set r's key in queue to 0
Set r's parent to nil
While queue is not empty:
Extract minimum value from queue, call it u
For each v adjacent to u:
if v is in the queue and weight (u,v) is less than v's key:
Decrease v's key in the queue to weight(u,v)
set parent of v to u
SINGLE SOURCE SHORTEST PATH
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some similarities to breadth-first search. Also, maintain similar
list of parents/predecessors. Shortest paths may not be unique.
If one cares about only a single destination, one can stop the algorithm
when it is reached, but it is not possible to do asymptotically better
by looking for only one destination than for enumerating all the nodes.
Some algorithms allow negative weights as long as there are no negative-weight
cycles reachable from the starting node.
HL 359 (cursory), CLR chapter 25, p514.
We will maintain a bound on the shortest-path weight to each node. We will
then relax it (reduce it) if we can reach it with lower cost from one of the
adjacent nodes based on *that* node's estimate.
Dijkstra's Algorithm:
Requires all weights nonnegative.
Will maintain set S of vertices whose final weights are determined and
a priority queue Q that contains all vertices NOT in S. CLR 527.
Queue in linear array: O(V^2+E) = O(V^2)
Queue in binary heap: O(E lg V)
Queue in Fibonacci heap: O(V lg V + E)
The historical development of the Fibonacci heap was motivated by this
algorithm, with far more decrease-key than extract-min calls.
Algorithm:
Set all distances infinite; all parents nil.
Set S to empty set.
Put all vertices in Q, prioritized on distance
While Q is not empty:
Remove minimum from Q and store it in u
Add u to set S
For each vertex v adjacent to u:
if distance(v) > distance(u) + weight(u,v):
parent(v)=u
distance(v) = distance(u) + weight(u,v)
Bellman-Ford Algorithm:
Allows negative weights; can detect negative-weight cycle. CLR 532;
note their loop variable i is unused. Generally slower O(VE)
Algorithm:
Set all distances infinite; all parents nil.
Repeat |V|-1 times: (relaxation loop)
For each edge (u,v):
if distance(v) > distance(u) + weight(u,v):
parent(v)=u
distance(v) = distance(u) + weight(u,v)
For each edge (u,v): (cycle detection loop)
if distance(v) > distance(u) + weight(u,v):
throw negative cycle exception
Shortest Path in DAGs:
Can be significantly optimized by knowing it is acyclic. CLR 536.
Algorithm:
Topologically sort vertices
Set all distances infinite; all parents nil.
For each vertex u in topological order:
For each vertex v adjacent to u:
if distance(v) > distance(u) + weight(u,v):
parent(v)=u
distance(v) = distance(u) + weight(u,v)
SINGLE PAIR SHORTEST PATH
~~~~~~~~~~~~~~~~~~~~~~~~~
These algorithms take advantage of the fact that there is a single destination
to try to search the most promising routes first. They are not addressed
in CLR or HL, but see also dynamic programming techniques.
Best-First Search (A*, A-Star):
We need an admissible (does not overestimate) heuristic. In geometric
problems, this can be the great-circle distance. The algorithm can be
seen as a version of Dijkstra's algorithm with an added heuristic to
prune dominated possibilities. It can also be seen as the most general
search, with breadth-first, depth-first, and Dijkstra's algorithm all
being special cases with certain heuristics.
Most significant drawback can be memory usage.
Use of "closed list" is optional if a solution is guaranteed to exist or
if there is another way of rejecting cycles. If it is used, the graph
must be monotonic (e.g. A->B->C costs no less than A->C) - this is
obviously satisfied for map routes but might not be the case for
arbitrary graphs.
G cost: Actual shortest distance from source to node
H cost: heuristic (low estimate) cost from node to destination
F cost: G+H
Algorithm (http://en.wikipedia.org/wiki/A*_search_algorithm):
Set closed list to empty set
Set open list to source node
g(source)=0
while open list is not empty:
set x to node with lowest F score
if x is dest:
return
remove x from open list
add x to closed list
for each y which is a neighbor of x and is not in closed list:
temp_g_score = g(x) + weight(x,y)
temp_better = 0
if y is not in open list:
insert y into open list
calculate heuristic h(y)
temp_better = 1
else if temp_g_score < g(y):
temp_better = 1
if temp_better:
parent(y) = x
g(y) = temp_g
f(y) = g(y) + h(y)
Algorithm (http://www.autonlab.org/tutorials/astar.html):
Set priority queue empty
Set visited list empty
Put source into queue with priority F
While priority queue is not empty:
Remove node with lowest F. Call it n
If n is the destination, return success
For each m that is a successor of (adjacent to) n:
f(m)=g(m)+h(m) = g(n)+weight(n,m)+h(m)
if m has not been seen
OR m has lower F now than from before
OR m is in the priority queue with heigher F:
Put m in priority queue with f(m)
Add/update visited list to include m,f(m),parent(m)=n
else take no action on m
If queue is empty and we have not returned already, fail!
ALL PAIRS SHORTEST PATH
~~~~~~~~~~~~~~~~~~~~~~~
It is possible to do fundamentally better than running a single-source
algorithm for every vertex. CLR chapter 26, page 550.
Matrix Algorithm Concept:
First construct all shortest paths of length 1. Then recursively add 1
to the maximum number of edges and recalculate until reaching |V|-1.
Extend-shortest-path:
Input matrix D distances, W weights
n = rows(D)
D' = new n by n matrix
for i in 1 to n:
for j in 1 to n:
d'(i,j) = infinity
for k = 1 to n
d'(i,j) = min(d'(i,j), d(i,k)+w(k,j))
Naive-all-pairs-shortest-path:
Start with D=W
Call extend on D,W n-2 times.
To do better, instead of processing by W each time, we can square W,
then square that, etc, until reaching a power of 2 which is equal
to or greater than |V|-1. We thus get a runtime of O(V^3 lg V)
Faster-all-pairs-shortest-path:
Start with D=W
for(m=1; m*=2; m < n-1):
Dnew = extend-shortest-path(D,D) ; note D,D *not* D,W
Floyd-Warshall:
Achives runtime of O(V^3). Basic concept is to construct
shortest paths using vertices {1..k}, and then extend to k+1.
d(0,i,j) = w (i,j)
d(k,i,j) = min(d(k-1,i,j), d(k-1, i, j) + d(k-1, k, j)
It remains to actually construct the paths in addition to calculating
the distances. One way is to calculate a matrix of predecessor vertices
on the shortest path. It is updated by replacing a value with k when
the corresponding min distance changes.
CLR page 558. A variant for calculating transitive closures only
(reachable nodes) is given on 562.
Johnson's:
Algorithm for sparse graphs o(V^2 lg V + VE), so faster if E is
smaller than V^2. Uses both Bellman-Ford and Dijkstra as components;
see CLR 569.
MAXIMUM FLOW
~~~~~~~~~~~~
...
==============================================================================
GEOMETRIC ALGORITHMS
==============================================================================
LINE SEGMENTS
~~~~~~~~~~~~~
Cross Product:
Remember it can be found as the determinant of the matrix:
[x1 x2]
[y1 y2] = x1y2-x2y1
Direction of Turn:
Take a line segment p0-p1 and p0-p2. Translate p0 to the origin.
Then calculate the cross product of p1 and p2. If it is positive,
p0-p1 is clockwise from p0-p2.
Two Line Segment Intersection:
First, check whether bounding boxes intersect:
max(x1,x2) >= min(x3,x4) AND
max(x3,x4) >= min(x1,x2) AND
max(y1,y2) >= min(y3,y4) AND
max(y3,y4) >= min(y1,y2)
Then, if this is true, check line segments for straddling each other
with cross products. They intersect if p1-p3 and p1-p4 have different
orientations relative to p1-p2:
sign((p3-p1) X (p2-p1)) != sign((p4-p1) X (p2-p1))
If either of the cross products are exactly zero, the endpoint of
one is on the other. If they're both zero, they are colinear.
CLR 888
Many Line Segment Intersection:
Do any to segments in an arbitrarily large set intersect? O(n lg n)
Simplified algorithm assumes no vertical segments and no three
segments intersecting at one point. CLR 892.
The idea is to move a "sweep line" left to right (or any other
direction). Each time an endpoint is encountered, check to see whether
any of the segments which are intersected by the sweep linehave changed
order relative to each other. If so, they crossed.
CONVEX HULLS
~~~~~~~~~~~~
Graham's Scan:
O(n lg n). CLR 900. Uses a stack to keep track of verticies which
are tentatively on the convex hull.
Start with p0 as the bottommost point (or leftmost, etc)
Sort points p1 through pn by angle to p0 in clockwise order
Push p0, p1, p2 onto stack
for i in 3 to n:
while the angle from 2nd-from-top, top-in-stack, and pi is nonleft:
pop from stack
push pi onto stack
return points in stack
Jarvis' March:
O(nh) where h is the number of points on the hull.
Start with lowest point. Calculate the polar angle of all the other
points to this one. Pick the one with the lowest value. Then,
recalculate with regard to that point. Repeat until getting back
to the original one.
Envision it as sweeping an scan arm in until it hits a point each time,
then moving to that point and repeating. CLR 906 advocates an
approach of keeping a separate left and right side for simplicity in
calculating angles.
2D POINTS
~~~~~~~~~
Closest Pair of Points:
CLR 908
LOCALITY SENSITIVE HASHING
~~~~~~~~~~~~~~~~~~~~~~~~~~
Probabilistic assignment of similar items to the same buckets. It is
primarily used in high-dimensional data; it is asserted that by 10-20
dimensions, k-d trees and friends are searching over a significant fraction
of the space (Gionis, Indyk, Motwani 1999)
http://www.mit.edu/~andoni/LSH/
BSP (BINARY SPACE PARTITION TREE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Overview:
Interior nodes represent splitting lines (or plans, hyperplanes, etc in
higher dimensions), and also contain objects which are entirely within
the splitting plane. Leaf nodes are convex and contain 0 or 1 objects or
object fragments (since objects can be split). A BSP which is built only
using splitting lines which are extensions of input segments (or planes
which are extensions of input polygons) is called "auto-partition (BCKO
262).
This can be used with the painter's algorithm to select which things to
paint (BCKO 263). Other canonical uses include shadow generation,
set operations on polyhedra, cell decomposition methods for motion
planning, and as an index in GIS (BCKO 278 has references for each).
Both K-D trees and quad/octrees are a special case of BSP trees.
In Practice:
The size of the tree is O(n+2n*ln(n)) in 2D and O(n^2) in 3D space on
disjoint triangles. In practice, however, they perform well on low
density data. Thus, define the idea of density - the smallest lambda
such that any ball B intersectis at most lambda objects who have a diameter
greater than or equal to that of B (i.e. small objects are not counted).
(BCKO 271). However, the randomized auto-partition algorithm does not
perform well in the low-density case.
Define a "guard" as the corner of an axis-aligned square containing an
object. Split initially as if in a quad tree, but then if one quadrant
contains too many guards, shrink towards a corner using a heuristic k.
The details are given in BCKO. For disjoint line segments in the plane,
there is a tree of size O( log(lambda)), where lambda is the density.
For triangles in 3D, this is O(n lambda), which varies from n to n^2
as lambda varies from 1 to n.
K-D Trees
~~~~~~~~~
K-D trees are a special case of BSP trees. Each node is a point in
K-dimensional space, and each cutting hyperplane is in one of the K dimensions.
Basic Case:
Assume that no two points exactly match in x or y. Split first on x,
then split each of these on y, then on x, etc. Split on the median
point in the relevant dimension. Finding the median is the most
expensive part; presort the points in both x and y to facilitate this.
To query for points within a rectangle, recurse down the tree,
following only children which intersect the rectangle, checking the
points in regions which are not entirely within the rectangle, and taking
the union of their results. For m answers, the time to do the
query is O(sqrt(n) + m) (BCKO 104).
QUADTREES
~~~~~~~~~
Quadtrees are also a special case of BSP trees.
Basic Version for Points:
Split until each square contains at most one point. Make an arbitrary
decision about <= vs. >=. It is possible to store the size of each
square at the nodes, or recompute it while walking the tree. (BCKO 309)
The tree can be quite unbalanced, and the height is not a trivial function
of the number of points. However, it is at most log(s/c) + 3/2, where c
is the smallest distance between any two points and s is the side length
of the initial square (BCKO 311)
Finding Neighbors:
Let's say we want to find the north neighbor. As long as we are
NOT one of the south neighbors of our parent, move up. Then report
the northern neighbor from us. If we want a leaf, we have to them
walk back down, always choosing the souther neighbor. If we hit the
root while going up, we are at an edge and there is no neighbor in
the desired direction. (BCKO 313)
Balanced Quadtree:
Take the quadtree as balanced if any two neighboring squares differ by
at most 2x in size. To balance a tree, put all the leaves into a list.
For each leaf, find neighbors in all four directions and see if they
are to small. If so, the leaf has to be split. This might cause
its neighbors that now need to be split, so insert them into the list if
so. The balanced version does not have asymptotically more nodes, and
can be constructed in O((d+1)m) time (BCKO 315).
More Information:
BCKO 315 talks about mesh generation from quadtrees.
S. Aluru, "Quadtrees and Octrees" discusses compression.
Eppstein et al "The skip quadtree" combines with skiplists for
efficient insertion and removal.
R-TREES
~~~~~~~
Source for this section is original paper:
http://www.cs.fiu.edu/~vagelis/classes/COP6727/publications/gutman-rtree.pdf
"R-Trees: A Dynamic Index Structure for Sptial Searching" by Antonin Guttman
Overview:
The R stands for Rectangle, which is used as a bounding box. R-trees are
designed for paged storage, and resemble B-trees in this regard. Each node
has a maximum number of entries. Each entry in a non-leaf node stores
pointers to children and the bounding box of each child. Individual
entries can have nonzero size - they are rectangles, not points.
Each node has a maximum of M entries and a minimum of M/2 or less. The
root is exempt from the minimum.
Search:
The input to search is a query box. Start at the root, and move down
into those children which overlap with the query box.
Because more than one subtree may need to be visited, it is not possible
to guarantee good worst-case performance.
Insertion:
Start at the root. Choose the rectangle which requires the least
enlargement to hold the new entry. At the leaf, insert if the leaf
is not full, or split otherwise.
Adjust:
Start at a leaf N. If the leaf was just split, save the second one. Move
up to the parent and adjust the coverage rectangle to reflect If there
is a second node NN saved from the split, try to add it to the parent as
well. If the parent is now overfull, split the parent as well and repeat.
Splitting:
The M+1 items must be divided into 2 new nodes. It is desirable to
divide them such that the overall area is lowest, but it is prohibitive
to exhaustively check all combinations.
Several algorithms are described in the paper, both linear and quadratic
with respect to M; the conclusions say the linear algorithm is good
enough, and the slightly worse quality of the splits is not significant.
For this reason, I will describe only the linear algorithm.
Along each dimension, find the rectangles with the highest low side and
the lowest high side; record the separation. Normalize by the size of
the entire set in that dimension. Choose the most extreme pair in any
dimension. These are the two "seeds". Then, choose any of the remaining
entries, and add it to the set which is enlarged the least to include it.
If you get to a point where all entries are done, or where one set must
have the remainder of the available entries assigned to it to hit the
minimum, do so and stop.
==============================================================================
DYNAMIC PROGRAMMING
==============================================================================
Overview:
Dynamic programming is a recursive technique in which a subproblem can be
solved and then used to solve the larger problem. It is applicable to
problems like single pair shortest path (HL 425) or ordering matrix
multiplication (CLR 302). Traditionally, the dynamic programming problem
is computed bottom-up instead of actually recursing. In the matrix case,
compute all the subsequences of length 2, then use those to compute all the
length 3, and so on.
HL suggests that dynamic programming is applicable where the problem has
multiple stages with a state and a policy decision, the costs of each stage
of the problem is independent of the others, and there is a recursive
relationship to find the cost of stage n+1 using the cost of stage n (HL
430). CLR puts the same thing a bit differently, referring to optimal
substructure and overlapping subproblems (CLR 309). The latter corresponds
to a recursive solution needing to solve the same subproblems repeatedly.
"Memoization" - basically caching the results of the subproblems - provides
much of the benefit while allowing for the traditional top-down approach
instead of bottom-up.
Other Examples:
CLR 314: longest common subsequence. Define the subproblems in terms
of sequence prefixes, of which there are only m*n for strings of length
m and n.
CLR 320: optimal polygon triangulation.
HL 453: probabilistic dynamic programming with expected values
==============================================================================
LINEAR PROGRAMS
==============================================================================
CLR 539 (short). Strang chapter 8. HL in depth.
SIMPLEX METHOD
~~~~~~~~~~~~~~
Geometric Interpretation (HL 82):
Start at origin
Move to a better adject corner point feasible solution
(select the direction based on greatest slope)
When no adjacent CPF solutions are better, stop
Slack Variables (HL 87):
Introduced in order to make all constraints an equality
New problem is called the augmented form
Negative slack variable indicates infeasible solution
Basic solution: Augmented corner point solution
Basic feasible solution: Augmented corner point feasible (CPF) solution
Algorithmic Interpretation:
Number of basic variables is number of equations
Non-basic variables are the others
Set nonbasic variables to 0
Solve for basic variables
To look at next solution, change on basic variable
Algorithm HL92; tabular form HL95. Also Strang
==============================================================================
NONLINEAR PROGRAMS
==============================================================================
==============================================================================
INTEGER PROGRAMS
==============================================================================
==============================================================================
MATRIX ALGORITHMS
==============================================================================
LU DECOMPOSITION
~~~~~~~~~~~~~~~~
Use Gaussian elimination to turn original matrix A into an upper diagonal
matrix U. The coefficients of the rows which were subtracted go into
L. This gives A=LU.
Symmetric matrices can be decomposed further into A=LDL' (' is transpose here).
Create D by factoring out a diagonal matrix from U such that the pivots are all
1. What's left will be L'.
POSITIVE DEFINITE MATRICES
~~~~~~~~~~~~~~~~~~~~~~~~~~
Positive definite matrices indicate the presence of a minimum.
1. A matrix with positive pivots - which require a positive diagonal
and positive determinant - is a positive definite matrix.
2. A is positive definite if x'Ax=0 for all nonzero x. Strang 18
==============================================================================
AUCTION ALGORITHMS
==============================================================================
... section pending ...
==============================================================================
Well, that was my file of notes on algorithms. Parts of it may not be finished
yet. It may not be very useful to you, because it is so abbreviated, but it is
useful to *me* because it is abbreviated in a way that corresponds well with
the way things are already stored in my memory.
Notation: O() is used for an upper bound, Omega for a lower bound,
and Theta for both. lg is a base-2 logarithm.
CLR: Cormen, Leiserson, Rivest. Introduction to Algorithms. 1st edition
HL: Hillier and Lieberman. Introduction to Operations Research. 6th ed
BCKO: de Berg, Cheong, van Kreveld, Overmars, Computational Geometry, 3rd Ed
Strang: Gilbert Strang, Introduction to Applied Mathematics
To print this file nicely:
enscript -B -o alg.ps -1 algorithms.ref
psnup -c -4 -b-.25in -m.25in alg.ps alg2.ps
lpr alg2.ps
vim: ai