CS421 - Advanced Data Structures and Algorithm Development - Chapter 3: Selecting

median - smallest value bigger than half of the numbers
mode - most frequent number
percentiles - n% of all numbers are smaller than the n^th percentile

Rankings
- asymmetry - if a precedes b then b cannot precede a
- transitivity - if a precedes b and precedes c then a precedes c
- partial order - transitive asymmetric relation
- PAUSE: Is "sister of" a partial order? "cousin of"?
- relation on a set is a set of ordered pairs of elements of the set
- If every pair of elements in a set S is relatable by the p.o. relation R, then R is a linear order and S is orderable
- PAUSE: Is "funnier than" a linear order
- poset - partially ordered set
- One poset is a subposet of another if all its relations are contained within the other
- The poset containing n unrelated elements, called singletons, is a subposet of every poset with n elements
- voter's paradox (from game theory - decision making in uncertain, risky, or competitive situations
Finding the Best
```
FindMax (List, upper, lower)
	{ Find the index of the largest of List[lower, upper];
	  upper >= lower > 0	}
	max = lower
	for index = lower+1 to upper
		if List[index] > List[max] then
			max = index
	return max
```
n-1 comparisons
max poset - any poset solving the problem of finding the best
must connect every node to every other node; connecting n nodes requires at least n - 1 connections ==> FindMax has optimal worst cost of n-1
- The Average Cost - (n-1)/2????
  consider the n! equally likely permutations of the n elements
  probability that L[i] is better than all previous elements is 1/i
  The average number of assignments is 1 + sum(P(L[i] > L[j] for all j < i) x 1 = 1 + sum(1/i) {i from 2 to n} [n^th harmonic number
  H_2k = 1 + 1/2 + 1/3 + 1/4 + ... + 1/2^k
  = 1 + 1/2 + 2/4 + 4/8 + ... + 2^k-1/2^k
  = 1 + k/2
  BUT
  H_2k = 1 + 1/2 + 1/3 + 1/4 + ... + 1/2^k
  = 1 + 2/2 + 4/4 + 8/8 + ... + 2^k-1/2^k + 1/2^k
  = k + 1/2^k
  k + 1/2^k >= H_2^k >= k/2 + 1
  Since 2^{ceil(lg n)} >= n >= 2^{floor(lg n)}
  ceil(lg n) + 1/2^{ceil(lg n)} >= H_n >= floor(lg n) /2 + 1
  Hence H_n = Q(lg n)
- Statistics - know the average number of assignments, but not the likely variation
  - statistics - area of mathematics that analyzes, summarizes, and interprets numerical data
  - variance - average of the square of the differences of the numbers of assignments from the average
  - standard deviation - square root of the variance
  - random variable - function associated with an experiment that can take on numerical values of some property of the experiment
  - average of a random variable X = m and the standard deviation is s
  - 75% of a distribution lies within two standard deviations of the distribution's average (can be shown using Chebyshev's inequality)
  - With the assumption that each permutation of L is equally likely, the average # of assignments FindMax does is approx. ln n with a std. dev. of sqrt(ln n).
Finding the Second Best
2n-3 algorithm (find best, discard and find second best)
Is this optimal?
What if we break the problem into two halves and use FindMax on both halves , then find the second best (ceil[3n/2]-2)
```
FindSecondBest(List, lower, upper)
	{Find the index of the second largest of List[lower..upper]
	 Break the list into 2 and find the largest of each half, then
	 find the largest of the candidates for second largest.}
	 
	 mid = (lower+upper)/2
	 index1 = FindMax(List, lower, mid)
	 index2 = FindMax(List, mid+1, upper)
	 if List[index1] > List[index2] then
	 	List[index1] = List[index2]
		return FindMax(List, lower, mid)
	else
		List[index2] = List[index1]
		return FindMax(List, mid+1, upper)
```
Is ceil[3n/2]-1 optimal? ---- break the first part into quarters: ceil[5n/4]-1...
A height m binomial tree is
- a single node if m = 0
- two height m-1 binomial trees connected at their roots for m > 0
A height m binomial tree has 2^m nodes, so an n-node binomial tree has height lg n
there are (^{lg n}_l) level l nodes
(ⁱ_j) = (^i-1_j) + (^i-1_j-1)
By building a binomial tree, we can find the second best of n things using at most n + ceil[lg n] - 2 comparisons
```
BuildBinomialTree(List, lower, upper)
	{Build a binomial tree on List[lower..upper], upper - lower + 1 is
	 a power of two. upper >= lower > 0 }
	
	if upper > lower then
		mid = (lower + upper) / 2
		BuildBinomialTree(List, lower, mid)
		BuildBinomialTree(List, mid+1, upper)
		if List[mid] > List[upper] then
			List[lower..mid] >==< List[mid+1..upper]
```
BuildBinomialTree has worst number of comparisons = 2^{lg n} - 1 = n-1
Could build binomial tree using pointers but requires more space (two pointers for each node)
```
FindSecondMax(List, lower, upper)
	{Find the index of the second largest of List[lower..upper].
	 Rearrange List to form a binomial tree, then find the largest of the
	 candidates. upper-lower+1 is power of 2. upper > lower > 0 }
	 
	 BuildBinomialTree(List, lower, upper)
	 
	 2max = upper - 1
	 for i = 1 to lg(upper-lower+1) - 1
	 	if List[upper-2ⁱ] > List[2max] then
			2max = upper -2ⁱ
	return 2max
```
FindSecondMax does up to (n lg n) / 2 swaps
first phase does n - 1 comparisons
second phase does lg n - 1 comparisons
Is n + lg n - 2 optimal?
- Adversaries: fiends who try to make all algorithms solving a problem work as hard as possible (selects inputs that make solving the problem the hardest) -- Hangman example
- The Lower Bound
  PAUSE: To find the third best, do we need to know the best?
  second best algorithms: must be n-1 losers to best, and k-1 of thise must lose twice, so there must be at least n+k-2 comparisons
  let Alice's strength=a and Bob's strength=b, winner will have strength=a+b
  if a>=b then 2a >= a + b, so in the worst case the strength doubles and each original strength starts at 1
  strength must equal or exceed the number of contestants n, where 2^k >= n > 2^k-1 where k is the number of "fights" and k < ceil[lg n]
  therefore, finding the second best requires at least n + ceil[lg n] - 2 comparisons in the worst case and the FindSecondMax has the optimal worst cost
Finding the Best and the Worst
Use FindMax twice with different orders: 2n-3 comparisons
```
FindMaxMin(List, lower, upper)
	{Find the indicies of the largest and smallest of List[lower..upper]
	 upper >= lower > 0 }
	
	case upper - lower + 1
		= 1 : return lower, lower
		= 2 : if List[lower] > List[upper] then
				return lower, upper
			  else
			  	return upper, lower
		> 2 : 
				mid = (lower+upper)/2
				max1, min1 = FindMaxMin(List, lower, mid)
				max2, min2 = FindMaxMin(List, mid+1, upper)
				if List[max1] > List[max2] then
					max = max1
				else
					max = max2
				if List[min1] > List[min2] then
					min = min2
				else
					min = min1
				return max, min	
```
f(n) = 0 if n=1; 1 if n=2; f(floor[n/2])+f(ceil[n/2])+2 if n > 0
f(n) = 3n/2 - 2 + (n-2^{floor[lg n]})/2 if n <= 3x2^{floor[lg n]-1};
= 3n/2 -2 + (2^{floor[lg n]+1} - n)/2 otherwise
PAUSE: Check that f(n) = 3n/2 -2 when n = 2ⁱ or n = 2ⁱ+/- 1
PAUSE: Since the algorithm is good when n = 2ⁱ, what should we do if n = 2ⁱ + 2ⁱ? What does that decomposition suggest?
Split six things into two parts (3x3 vs. 4x2)
- Designing Algorithms Using Recurrence
  f(n) = 0 if n=1; 1 if n=2; min(1<=k<=n-1) {f(k) + f(n-k) + 2 if n > 2
  e.g. f(n) = 0 if n=1; 1 if n=2; f(2) + f(n-2) + 2 if n > 2
  f(n) = ceil[3n/2]-2
```
FindMaxMin(List, lower, upper)
	{Find the indicies of the largest and smallest of List[lower..upper]
	 upper >= lower > 0 }
	
	case upper - lower + 1
		= 1 : return lower, lower
		= 2 : if List[lower] > List[upper] then
				return lower, upper
			  else
			  	return upper, lower
		> 2 : 
				if List[lower] > List[lower+1] then
					max1 = lower; min1 = lower+1
				else
					max1 = lower+1; min1 = lower
				max2, min2 = FindMaxMin(List, lower+2, upper)
				if List[max1] > List[max2] then
					max = max1
				else
					max = max2
				if List[min1] > List[min2] then
					min = min2
				else
					min = min1
				return max, min	
```
  f(n) = 0 if n=1; 1 if n=2; f(n-2) + 3 if n>2
- The Lower Bound
  Show that ceil[3n/2]-2 is optimal by examining all possible states any algorithm solving the problem must go through. [state space lower bound]
  - Novices: not been compared yet
  - Winners: won at least once, but not lost
  - Losers: lost at least one, but not yet won
  - Moderates: both won and lost at least once
  starting state: (n,0,0,0) ==> ending state: (0,1,1,n-2) [wnat to show that it requires (3n/2)-2 comparisons to go from start to end
  six types of useful comparisons (N:N, W:W, L:L, N:W, N:L, W:L) [Table 3.4]
  use an adversary to force every transition to be the least efficient: winners always win except against other winners and losers always lose except against another loser [therefore we ignore W:L comparisons] [Table 3.5]
  Compare novices with novices requires floor[n/2] comparisons; then compare W:W and L:L for n-2 comparisons; overall comparisons is floor[3n/2]-2 (optimal worst cost)

Finding the ith Best

Split(List, lower, upper, pivot-loc)
	{Split List[lower..upper] into two parts, those less than List[pivot-loc]
	 on the left and those greater than it on the right. Return the pivot's
	 new position. upper >= pivot-loc >= lower > 0}
	 
	 pivot = List[pivot-loc]
	 List[lower] <=> List[pivot-loc]
	 pivot-loc = lower
	 for index = lower+1 to upper
	 	if pivot > List[index] then
			pivot-loc++
			List[index] <=> List[pivot-loc]
	List[lower] <=> List[pivot-loc]
	return pivot-loc

Finding the ith Best By Randomizing

Find(List, lower, upper, i)
	{Find the index of the ith largest of List[lower..upper]
	 upper >= lower > 0; upper - lower + 1 >= i >= 1 }
	 
	 index = uniform(lower, upper)
	 pivot-loc = Split(List, lower, upper, index)
	 case upper - pivot-loc
	 	< i-1: Find(List, lower, pivot-loc - 1, i-upper+ pivot-loc - 1)
		= i-1: return pivot-loc
		> i-1: Find(List, pivot-loc + 1, upper, i)

Finding the ith Best Without Randomizing

Select(List, lower, upper, i)
	{Find the index of the ith largest of List[lower..upper].
	 Use the list Medians[1..ceil[n/5]] as extra storage.
	 upper >= lower > 0; upper - lower + 1 >= i >= 1 }
	 
	 n = upper - lower + 1	 
	 for j = 0 to ceil[n/5] - 1
	 	Medians[j+1] = median of List[(lower+5j)..(lower+5j+4)]
	 if n is not a multiple of 5 then
		Medians[ceil[n/5]] = median of the remaining elements of List
	 index = Select(Medians, 1, ceil[n/5], ceil[n/10])
	 Make index the index of the corresponding element of List
	 pivot-loc = Split(List, lower, upper, index)
	 case upper - pivot-loc
	 	< i-1: Select(List, lower, pivot-loc - 1, i - upper + pivot-loc -1_
		= i-1: return pivot-loc
		> i-1: Select(List, pivot-loc + 1, upper, i)

The Partition Problem
What is the worst cost for find the i₁ best elements, the i₂ next best elements, ..., the i_k worst elements?
selection is half a serach problem (finding the i^th) and half a structured problem (order the input to five the i^th).
searching is a partition problem where our input is a sorted list plus one more element and we have to produce a sorted list.
sorting is a partition problem where each i_j is one element
Changing the Model
Want to find the "approximate best":
Let X = "approximate best" produced by algorithm and Y = "actual best":
- X = Y
- X is close to Y:
  rank(X) <= rank(Y) + "small" [approximation algorithm)
- X is usually Y:
  P(X = Y) >= large [probablistic algorithm]
- X is "usually" "close" to Y:
  P(rank(X) <= rank(Y) + "small") >= "large" [heuristic]
last three algorithms are relaxed algorithms
Coda - Artists and Artisans