To Previous Chapter To Table of Contents To Bottom of Page To Next Chapter

CS421 - Advanced Data Structures and Algorithm Development - Chapter 2: SEARCHING

Good search algorithms depend on the search problem: Assume that we are searching for a well-defined and fixed target; have unbounded time; domain is static and finite; all probes cost the same; never make a mistake; we remember information gathered during; use a key to represent the element
  1. Linear Search
    • The Problem: given a list, L, of n elements and one more element, X, find X in L. If X is in L, identify at least one element of L that it is equal to.
    • The Model
      • Algorithm is run on an errorless, sequential, digital computer and that the algorithm will be translated into a program in an imperative language.
      • All elements of L are unique
      • We can compare X or any element of L with any element of L and determine if the two elements are equal, or which one is less.
      • We can't find ordering information about X and the elements of L except by comparing elements.
      • GOAL: minimize the number of "3-way" element-by-element comparisons.
    • The Algorithm reduce the problem to something simpler
      LINEAR_SEARCH (List, lower, upper, X)
      	{Look for X in List[lower, upper]. Report its position if found,
      	 else report 0. upper >= lower > 0.}
      	 
      	 if List[lower] = X then
      	 	return lower
      	 else if lower = upper then
      	 	return 0
      	 else
      	 	return LINEAR_SEARCH (List, lower+1, upper, X)	 
      
    • Analysis
      1. find upper bound on the worst cost of the algorithm
      2. find lower bound on the worst cost of the problem
      3. find corresponding bounds on the average cost
    • The Worst Cost
      number of comparisons = f(n) = 1 if n=1; f(n-1)+1 if n > 1
      LINEAR_SEARCH (List, lower, upper, X)
      	{Look for X in List[lower, upper]. Report its position if found,
      	 else report 0. upper >= lower > 0.}
      	 
      	 index = lower
      	 while (upper >= index and List[index] != X)
      	 	index++
      	if index > upper then
      		return 0
      	else
      		return index
      
      Prove that: f(n) = 1 if n=1; f(n-1)+1 if n > 1 ==> f(n) = n
      NOTE: Proof by induction only works if we know the answer
      Can use the method substitute and guess to try and find a possible answer
      f(n) = f(n-1) + 1			[= f(n-1) + 1]
      	 = {f(n-2) + 1} + 1		[= f(n-2) + 2]
      	 :
      	 = f(n-i) + i			( this is a guess, NOT a proof )
      
      Let f(n) = f(n - (n-1)) + n -1) = f(1) + n - 1 = 1 + n - 1 = n
      Another method is guess and test:
      let f(n) = rn + s
      r + s = 1
      rn + s = r(n-1) + s + 1 ==> r = 1 and s = 0 ==> f(n) = n
    • The Lower Bound on the Worst Cost
      NOTE: Let g(n) = an upper bound and h(n) = a lower bound; If beyond some point g(n) - h(n) <= c, for some constant c, then there is no point looking for a better algorithm; up to a constant, A is as good a solution as is possible within M. A is worst case optimal for problem P.
      If g(n) - h(n) > c for any c, even for large n, but beyond some point g(n)/h(n) <= c, for some non-zero constant c then within M, A is worst case asymptotically optimal for P.
      To prove that LINEAR SEARCH has optimal worst cost, show that any algorithm that can accomplish the job in less time is incorrect (i.e. no algorithm is "better")
      Proof by contradiction
      PAUSE: Couldn't an algorithm do something clever in the preprocessing stage so that it wouldn't have to compare X to a large number of elements?
    • Probability Theory (expressing the uncertainty of complicated events in terms of the assumed probability of simpler events)
      • sample space of an experiment is the set of all outcomes of the experiment
      • event - subset of the sample space
      • disjoint - if two events share no outcomes
      • probability of an event is the ratio of the number of favourable outcomes divided by the total number of possible outcomes
      • Two events are independent if the probability of each event does not depend on the other
    • The Average Cost [How many comparisons does LINEAR SEARCH do on average?]
      • Assume that X is equally likely to be any of the elements of L
      • Assume that the only thing varying is X's position in L, if it is in L.
      • If X must be in L, then LINEAR SEARCH examines roughly half of L.
      • If X is equally likely to not be in L as to be any one of the elements of L, then LINEAR SEARCH examines roughly half of L.
      • If X is equally likely to be in L as to not be in L, then LINEAR SEARCH examines roughly three-quarters of L.
      • If X cannot be in L, then LINEAR SEARCH examines all of L.
    • The Lower Bound on the Average Cost
      Given a problem, P, a model, M, the set of algorithms, AM, solving P within M, together with the set of all size n instances of P, In, and a cost function fA,
      the lower bound on the average cost of P = min(all algorithms){SP(I)fA(I)}
      If all comparisons must involve X, then LINEAR SEARCH has optimal average cost
      PAUSE: Why does every arrangement give the same average?
      Every average case optimal algorithm can only compare X to elements of L. Since the algorithm must also work in the worst case, it must compare X to every element of L. All algorithms cost the same as LINEAR SEARCH, so LINEAR SEARCH is optimal in the average case.
    • Programming
      • iterative solution is usually superior to the recursive for most languages and machines
      • could place X in L[n+1] or L[0]
      • could search "near" the previous search
  2. Jump Search assume that L is sorted in increasing order
    • What is the best element to query so that in the worst case we gain as much information as possible?
    • Should we start with L[1]?
    • look at L[k], L[2k], L[3k], ... {JUMP SEARCH}
      JUMP_SEARCH (List, lower, upper, jump, X)
      	{Look for X in List[lower, upper]. Report its position if found,
      	 else report 0. List is sorted in increasing order;
      	 upper >= lower > 0; upper - lower + 1 >= jump >= 1.}
      	 
      	 index = lower + jump - 1;
      	 while (upper > index and X > List[index])
      	 	index = index + jump;
      	if index < upper then
      		upper = index
      	LINEAR_SEARCH (List, index - jump + 1, upper, X)
      
    • How many comparisons does JUMPSEARCH do in the worst case?
    • Let f(n,k) = m + k - 1 = [n/k] + k - 1 where m = # of probes, k = jump size, and [] is the floor function {see table 2.2}
    • Square Root Search
      • Want to find the best jump size, i.e. min1<=k<=n{[n/k] + k - 1]
      • let f(x) = n/x + x ==> f'(x) = - n/x2 + 1
      • f'(x) = 0 ==> x2 = n ==> x = +/-sqrt(n)
      • f''(x) = 2n/x3 > 0 if x = sqrt(n)
      • Thus x = sqrt(n) is the minimum and we should probe every [sqrt(n)]th element
      • the square root search takes no more than >sqrt(n)< + [sqrt(n)] - 1 comparisons where >x< is the nearest integer function (cut the number of comparisons from n to 2sqrt(n) - 1 in the worst case)
  3. Binary Search - divide the sublist into almost equal halves
    BINARY_SEARCH (List, lower, upper, X)
    	{Look for X in List[lower, upper]. Report its position if found,
    	 else report 0. upper >= lower > 0.}
    	 
    	if lower = upper then
    		if X = List[lower] then
    			return lower
    		else 
    			return 0
    	else
    		mid = [(lower+upper) / 2)
    		if X > List[mid] then
    			return BINARY_SEARCH(List, mid+1, upper, X)
    		else
    			return BINARY_SEARCH(List, lower, mid, X)
    
    • Worst Cost
      • f(n) = 1 if n = 1; f([n/2]) + 1 if n > 1 where [] is floor
      • f is non-decreasing
      • f(n) = 1 + f([n/2]) = 1 + [lg n]
        Proof:
        f(n) = f(n/2) + 1
        	 = {f(n/4) + 1} + 1
        	 = {{f(n/8) + 1} + 1} + 1
        	 :
        	 = f(n / 2i) + i
        Continue until n = 2i and therefore i = lg n;
        f(n) = f(1) + lg n = 1 + lg n
        
      • Can we improve our search time still more? Is sqrt(lg n) possible? Is constant time possible?
    • Decision Trees
      Predictable (Deterministic) Algorithm: sequence of comparisons is the same for a fixed input.
      • Let A be any predictable search algorithm; must have a first comparison between X and L[i] ==>: decision tree [node, children, tree, binary tree, ordered tree, root, leaf]
      • level - # of nodes on the path between the root and the node
      • height - level of deepest leaf
      • Tree's height plus one is an upper bound on how many comparisons the algorithm must do to find X in L.
      • at least n nodes in the tree
      • every height m rooted binary tree has at most 2m+1 - 1 nodes [proof by induction]
      • Thus 2m+1 - 1 >= n ==> m + 1 >= lg (n + 1)
        m+1 >= [lg(n+1)] = [lg n] + 1
        Thus BINARY SEARCH has optimal worst cost within the comparison-based model
    • The Average Cost (assume that X is in L and equally likely to be anywhere in L)
      • average ~ (1x1 + 2x2 + 3x4 + 4x8 + ... + lg n 2lg n - 1)/ n = 1/n (summation i2i-1) from 1 to lg n
      • this is roughly n(lg n - 1) / n = lg n - 1
      • f(n) = (0 if n = 0) | (1 if n = 1) | ([n/2]-1)/n*f([n/2]-1) + 1/n*0 + [n/2]/n*f([n/2])+1 if n >1
      • multiply by n and replace n f(n) with g(n) ==>
        g(n) = (0 if n = 0) | (1 if n = 1) | (g([n/2]-1)+g([n/2])+n
      • subtract g(n) - g(n-1) and replace with h(n) ==>
        h(n) = (1 if n =1) | h([n/2]) + 1 if n > 1 which is the same recurrence as the worst cost of BINARYSEARCH and can be replaced by ceil(lg(n+1)) ==> f(n) is no worse than ceil(lg(n+1)) and the average cost when X is in L is at most two comparisons less than the worst cost
    • The Lower Bound on the Average Cost
      • model every search algorithmin the comparison-based model by a decision tree (must have at least n nodes & every node is a possible outcome)
      • the average # of comparisons = (path length / n) + 1
      • Build a binary search tree (greedy stretegy) which minimizes the path length.
      • minimum average height of an n-node binary tree is
        ceil(lg(n+1)) - (2ceil(lg(n+1)) - ceil(lg(n+1)) - 1)/ n - 1 which is the same as the average cost of a successful BINARYSEARCH ==> optimal
    • Programming - rewrite to test equality at end; tricky algorithm
  4. Changing the Model
    Binary search is optimal, but shouldn't be used if:
    1. data is not or cannot be sorted
    2. data is sorted by is structured so that it the cost is not the same everywhere
    3. data is sorted and probe costs are uniform, but we can exploit knowledge about the information
    4. data is static so we know all possible serach requests in advance
    • Randomized Linear Search
      CoinFlipSearch (List, lower, upper, X)
      	{look for X in List[lower...upper]. Report its position if found,
      	 else report 0. upper >= lower > 0 }
      	 
      	 flip a coin
      	 if heads then
      	 	index = lower
      		while (upper >= index) and List[index] != X
      			index++
      		if index > upper then
      			return 0
      		else
      			return index
      	else
      		index = upper
      		while (index >= lower) and List[index] != X
      			index--
      		if lower > index then
      			return 0
      		else
      			return index
      

      randomized algorithms - algorithm that makes decisions using random numbers
      • worst cost(A) = max fA(I)
      • average cost(A) = sum(P(I)fA(I)) where P(I) is the probability that I occurs
      • expand to fliping a coin three times (8 choices) ==>: worst cost of each leav algorithm can occur for different inputs
      • use "expected" when it is for many algorithms and a fixed input, and "average" when it is for many inputs and a fixed algorithm
      • Randomizing makes the average cost easier to predict & randomizing decouples the algorithm's behaviour from its input
    • Searching Linked Lists [worst case requires n comparisons]
      LinkedListSearch (List, n, head, next, guesses, X)
      	{look for X in the n element signly linked implicit list List,
      	 Report its position if found, else report 0. List is sorted in
      	 increasing order through next, List[head] is the smallest element of
      	 List, and the largest element of List has a next value of nil.
      	 n >= 1 and guesses is the number of samples of LIst examined.}
      	 
      	 largest = head
      	 for i = 1 to guesses
      	 	j = uniform(1, n) // returns random integer chosen with uniform probability
      		if X > List[j] and List[j] > List[largest]
      			largest = j
      	index = largest
      	while (next[index] != nil) and X > List[index]
      		index = next[index]
      	if List[index] = X then
      		return index
      	else
      		return 0
      

      worst expected cost = 2 sqrt(n) + o(sqrt(n))
    • Interpolation Search - reduce the average cost IF we know the probability distribution of the universe of which L is a sample
      IF L is a sorted list of numbers which uniformly partitions the set of numbers then X should be near L[ceil(pn)] where p = (X - L[1]) / (L[n] - L[1])
      algorithm is O(lg lg n) but can take n comparisons in the worst case;
      interleaved algorithm interleaves the binary search with an interpolation search ==> worst cost O(lg n) and average cost O(lg lg n)
    • Hashing - used when we know all possible search requests (i.e. reserved words)
        hash function should
      • be easy to compute
      • have values which are easy to compare
      • produce numbers of roughly the same size as the size of the search space
      • minimize the number of collisions
  5. Coda--Apples and Oranges

To Previous Chapter To Table of Contents To top of page To Next Chapter