CS421 - Advanced Data Structures and Algorithm Development - Chapter 2: SEARCHING

Good search algorithms depend on the search problem:

we may probe inaccurately
probes may have unequal cost
domains may be infinite
targets may move
targets may be ill-defined
search time may be bounded
searches may be all of the above

Assume that we are searching for a well-defined and fixed target; have unbounded time; domain is static and finite; all probes cost the same; never make a mistake; we remember information gathered during; use a key to represent the element

Linear Search
- The Problem: given a list, L, of n elements and one more element, X, find X in L. If X is in L, identify at least one element of L that it is equal to.
- The Model
  - Algorithm is run on an errorless, sequential, digital computer and that the algorithm will be translated into a program in an imperative language.
  - All elements of L are unique
  - We can compare X or any element of L with any element of L and determine if the two elements are equal, or which one is less.
  - We can't find ordering information about X and the elements of L except by comparing elements.
  - GOAL: minimize the number of "3-way" element-by-element comparisons.
- The Algorithm reduce the problem to something simpler
```
LINEAR_SEARCH (List, lower, upper, X)
	{Look for X in List[lower, upper]. Report its position if found,
	 else report 0. upper >= lower > 0.}
	 
	 if List[lower] = X then
	 	return lower
	 else if lower = upper then
	 	return 0
	 else
	 	return LINEAR_SEARCH (List, lower+1, upper, X)	 
```
- Analysis
  1. find upper bound on the worst cost of the algorithm
  2. find lower bound on the worst cost of the problem
  3. find corresponding bounds on the average cost
- The Worst Cost
  number of comparisons = f(n) = 1 if n=1; f(n-1)+1 if n > 1
```
LINEAR_SEARCH (List, lower, upper, X)
	{Look for X in List[lower, upper]. Report its position if found,
	 else report 0. upper >= lower > 0.}
	 
	 index = lower
	 while (upper >= index and List[index] != X)
	 	index++
	if index > upper then
		return 0
	else
		return index
```
  Prove that: f(n) = 1 if n=1; f(n-1)+1 if n > 1 ==> f(n) = n
  NOTE: Proof by induction only works if we know the answer
  Can use the method substitute and guess to try and find a possible answer
```
f(n) = f(n-1) + 1			[= f(n-1) + 1]
	 = {f(n-2) + 1} + 1		[= f(n-2) + 2]
	 :
	 = f(n-i) + i			( this is a guess, NOT a proof )
```
  Let f(n) = f(n - (n-1)) + n -1) = f(1) + n - 1 = 1 + n - 1 = n
  Another method is guess and test:
  let f(n) = rn + s
  r + s = 1
  rn + s = r(n-1) + s + 1 ==> r = 1 and s = 0 ==> f(n) = n
- The Lower Bound on the Worst Cost
  NOTE: Let g(n) = an upper bound and h(n) = a lower bound; If beyond some point g(n) - h(n) <= c, for some constant c, then there is no point looking for a better algorithm; up to a constant, A is as good a solution as is possible within M. A is worst case optimal for problem P.
  If g(n) - h(n) > c for any c, even for large n, but beyond some point g(n)/h(n) <= c, for some non-zero constant c then within M, A is worst case asymptotically optimal for P.
  To prove that LINEAR SEARCH has optimal worst cost, show that any algorithm that can accomplish the job in less time is incorrect (i.e. no algorithm is "better")
  Proof by contradiction
  PAUSE: Couldn't an algorithm do something clever in the preprocessing stage so that it wouldn't have to compare X to a large number of elements?
- Probability Theory (expressing the uncertainty of complicated events in terms of the assumed probability of simpler events)
  - sample space of an experiment is the set of all outcomes of the experiment
  - event - subset of the sample space
  - disjoint - if two events share no outcomes
  - probability of an event is the ratio of the number of favourable outcomes divided by the total number of possible outcomes
  - Two events are independent if the probability of each event does not depend on the other
- The Average Cost [How many comparisons does LINEAR SEARCH do on average?]
  - Assume that X is equally likely to be any of the elements of L
  - Assume that the only thing varying is X's position in L, if it is in L.
  - If X must be in L, then LINEAR SEARCH examines roughly half of L.
  - If X is equally likely to not be in L as to be any one of the elements of L, then LINEAR SEARCH examines roughly half of L.
  - If X is equally likely to be in L as to not be in L, then LINEAR SEARCH examines roughly three-quarters of L.
  - If X cannot be in L, then LINEAR SEARCH examines all of L.
- The Lower Bound on the Average Cost
  Given a problem, P, a model, M, the set of algorithms, A_M, solving P within M, together with the set of all size n instances of P, I_n, and a cost function f_A,
  the lower bound on the average cost of P = min(all algorithms){SP(I)f_A(I)}
  If all comparisons must involve X, then LINEAR SEARCH has optimal average cost
  PAUSE: Why does every arrangement give the same average?
  Every average case optimal algorithm can only compare X to elements of L. Since the algorithm must also work in the worst case, it must compare X to every element of L. All algorithms cost the same as LINEAR SEARCH, so LINEAR SEARCH is optimal in the average case.
- Programming
  - iterative solution is usually superior to the recursive for most languages and machines
  - could place X in L[n+1] or L[0]
  - could search "near" the previous search
Jump Search assume that L is sorted in increasing order
- What is the best element to query so that in the worst case we gain as much information as possible?
- Should we start with L[1]?
- look at L[k], L[2k], L[3k], ... {JUMP SEARCH}
```
JUMP_SEARCH (List, lower, upper, jump, X)
	{Look for X in List[lower, upper]. Report its position if found,
	 else report 0. List is sorted in increasing order;
	 upper >= lower > 0; upper - lower + 1 >= jump >= 1.}
	 
	 index = lower + jump - 1;
	 while (upper > index and X > List[index])
	 	index = index + jump;
	if index < upper then
		upper = index
	LINEAR_SEARCH (List, index - jump + 1, upper, X)
```
- How many comparisons does JUMPSEARCH do in the worst case?
- Let f(n,k) = m + k - 1 = [n/k] + k - 1 where m = # of probes, k = jump size, and [] is the floor function {see table 2.2}
- Square Root Search
  - Want to find the best jump size, i.e. min_1<=k<=n{[n/k] + k - 1]
  - let f(x) = n/x + x ==> f'(x) = - n/x² + 1
  - f'(x) = 0 ==> x² = n ==> x = +/-sqrt(n)
  - f''(x) = 2n/x³ > 0 if x = sqrt(n)
  - Thus x = sqrt(n) is the minimum and we should probe every [sqrt(n)]^th element
  - the square root search takes no more than >sqrt(n)< + [sqrt(n)] - 1 comparisons where >x< is the nearest integer function (cut the number of comparisons from n to 2sqrt(n) - 1 in the worst case)
Binary Search - divide the sublist into almost equal halves
```
BINARY_SEARCH (List, lower, upper, X)
	{Look for X in List[lower, upper]. Report its position if found,
	 else report 0. upper >= lower > 0.}
	 
	if lower = upper then
		if X = List[lower] then
			return lower
		else 
			return 0
	else
		mid = [(lower+upper) / 2)
		if X > List[mid] then
			return BINARY_SEARCH(List, mid+1, upper, X)
		else
			return BINARY_SEARCH(List, lower, mid, X)
```
- Worst Cost
  - f(n) = 1 if n = 1; f([n/2]) + 1 if n > 1 where [] is floor
  - f is non-decreasing
  - f(n) = 1 + f([n/2]) = 1 + [lg n]
    Proof:
```
f(n) = f(n/2) + 1
	 = {f(n/4) + 1} + 1
	 = {{f(n/8) + 1} + 1} + 1
	 :
	 = f(n / 2ⁱ) + i
Continue until n = 2ⁱ and therefore i = lg n;
f(n) = f(1) + lg n = 1 + lg n
```
  - Can we improve our search time still more? Is sqrt(lg n) possible? Is constant time possible?
- Decision Trees
  Predictable (Deterministic) Algorithm: sequence of comparisons is the same for a fixed input.
  - Let A be any predictable search algorithm; must have a first comparison between X and L[i] ==>: decision tree [node, children, tree, binary tree, ordered tree, root, leaf]
  - level - # of nodes on the path between the root and the node
  - height - level of deepest leaf
  - Tree's height plus one is an upper bound on how many comparisons the algorithm must do to find X in L.
  - at least n nodes in the tree
  - every height m rooted binary tree has at most 2^m+1 - 1 nodes [proof by induction]
  - Thus 2^m+1 - 1 >= n ==> m + 1 >= lg (n + 1)
    m+1 >= [lg(n+1)] = [lg n] + 1
    Thus BINARY SEARCH has optimal worst cost within the comparison-based model
- The Average Cost (assume that X is in L and equally likely to be anywhere in L)
  - average ~ (1x1 + 2x2 + 3x4 + 4x8 + ... + lg n 2^{lg n - 1})/ n = 1/n (summation i2^i-1) from 1 to lg n
  - this is roughly n(lg n - 1) / n = lg n - 1
  - f(n) = (0 if n = 0) | (1 if n = 1) | ([n/2]-1)/n*f([n/2]-1) + 1/n*0 + [n/2]/n*f([n/2])+1 if n >1
  - multiply by n and replace n f(n) with g(n) ==>
    g(n) = (0 if n = 0) | (1 if n = 1) | (g([n/2]-1)+g([n/2])+n
  - subtract g(n) - g(n-1) and replace with h(n) ==>
    h(n) = (1 if n =1) | h([n/2]) + 1 if n > 1 which is the same recurrence as the worst cost of BINARYSEARCH and can be replaced by ceil(lg(n+1)) ==> f(n) is no worse than ceil(lg(n+1)) and the average cost when X is in L is at most two comparisons less than the worst cost
- The Lower Bound on the Average Cost
  - model every search algorithmin the comparison-based model by a decision tree (must have at least n nodes & every node is a possible outcome)
  - the average # of comparisons = (path length / n) + 1
  - Build a binary search tree (greedy stretegy) which minimizes the path length.
  - minimum average height of an n-node binary tree is
    ceil(lg(n+1)) - (2^{ceil(lg(n+1))} - ceil(lg(n+1)) - 1)/ n - 1 which is the same as the average cost of a successful BINARYSEARCH ==> optimal
- Programming - rewrite to test equality at end; tricky algorithm
Changing the Model
Binary search is optimal, but shouldn't be used if:
1. data is not or cannot be sorted
2. data is sorted by is structured so that it the cost is not the same everywhere
3. data is sorted and probe costs are uniform, but we can exploit knowledge about the information
4. data is static so we know all possible serach requests in advance
- Randomized Linear Search
```
CoinFlipSearch (List, lower, upper, X)
	{look for X in List[lower...upper]. Report its position if found,
	 else report 0. upper >= lower > 0 }
	 
	 flip a coin
	 if heads then
	 	index = lower
		while (upper >= index) and List[index] != X
			index++
		if index > upper then
			return 0
		else
			return index
	else
		index = upper
		while (index >= lower) and List[index] != X
			index--
		if lower > index then
			return 0
		else
			return index
```
  randomized algorithms - algorithm that makes decisions using random numbers
  - worst cost(A) = max f_A(I)
  - average cost(A) = sum(P(I)f_A(I)) where P(I) is the probability that I occurs
  - expand to fliping a coin three times (8 choices) ==>: worst cost of each leav algorithm can occur for different inputs
  - use "expected" when it is for many algorithms and a fixed input, and "average" when it is for many inputs and a fixed algorithm
  - Randomizing makes the average cost easier to predict & randomizing decouples the algorithm's behaviour from its input
- Searching Linked Lists [worst case requires n comparisons]
```
LinkedListSearch (List, n, head, next, guesses, X)
	{look for X in the n element signly linked implicit list List,
	 Report its position if found, else report 0. List is sorted in
	 increasing order through next, List[head] is the smallest element of
	 List, and the largest element of List has a next value of nil.
	 n >= 1 and guesses is the number of samples of LIst examined.}
	 
	 largest = head
	 for i = 1 to guesses
	 	j = uniform(1, n) // returns random integer chosen with uniform probability
		if X > List[j] and List[j] > List[largest]
			largest = j
	index = largest
	while (next[index] != nil) and X > List[index]
		index = next[index]
	if List[index] = X then
		return index
	else
		return 0
```
  worst expected cost = 2 sqrt(n) + o(sqrt(n))
- Interpolation Search - reduce the average cost IF we know the probability distribution of the universe of which L is a sample
  IF L is a sorted list of numbers which uniformly partitions the set of numbers then X should be near L[ceil(pn)] where p = (X - L[1]) / (L[n] - L[1])
  algorithm is O(lg lg n) but can take n comparisons in the worst case;
  interleaved algorithm interleaves the binary search with an interpolation search ==> worst cost O(lg n) and average cost O(lg lg n)
- Hashing - used when we know all possible search requests (i.e. reserved words)
  - be easy to compute
  - have values which are easy to compare
  - produce numbers of roughly the same size as the size of the search space
  - minimize the number of collisions
Coda--Apples and Oranges