3.3. Search in Sorted Arrays — Senior Algorithms (2024)

3.3.1. Analysis

For large collections of records that are searched repeatedly,sequential search is unacceptably slow.One way to reduce search time is to preprocess the records bysorting them.Given a sorted array,an obvious improvement over simple linear search is to test if thecurrent element in L is greater than \(K\).If it is, then we know that \(K\) cannot appear later in thearray, and we can quit the search early.But this still does not improve the worst-case cost of the algorithm.

3.3.1.1. Jump Search

We can also observe that if we look first at position 1 in sortedarray L and find that K is bigger, then we rule outposition 0 as well as position 1.Because more is often better, what if we look at position 2 inL and find that \(K\) is bigger yet?This rules out positions 0, 1, and 2 with one comparison.What if we carry this to the extreme and look first at the lastposition in L and find that \(K\) is bigger?Then we know in one comparison that \(K\) is not in L.This is useful to know, but what is wrong with the conclusionthat we should always start by looking at the last position?The problem is that, while we learn a lot sometimes (in one comparisonwe might learn that \(K\) is not in the list), usually we learnonly a little bit (that the last element is not \(K\)).

The question then becomes: What is the right amount to jump?This leads us to an algorithm known as Jump Search.For some value \(j\), we check every \(j\) ‘th element inL, that is, we check elements \(\mathbf{L}[j]\),\(\mathbf{L}[2j]\), and so on.So long as \(K\) is greater than the values we are checking, wecontinue on.But when we reach a value in L greater than \(K\), we do alinear search on the piece of length \(j-1\) that we know brackets\(K\) if it is in the list.

If we define \(m\) such that \(mj \leq n < (m+1)j\),then the total cost of this algorithm is at most \(m + j - 1\)3-way comparisons.(They are 3-way because at each comparison of \(K\) with some\(\mathbf{L}[i]\) we need to know if \(K\) is less than,equal to, or greater than \(\mathbf{L}[i]\).)Therefore, the cost to run the algorithm on \(n\) items with ajump of size \(j\) is

\[\mathbf{T}(n, j) = m + j - 1 =\left\lfloor \frac{n}{j} \right\rfloor + j - 1.\]

What is the best value that we can pick for \(j\)?We want to minimize the cost:

\[\min_{1 \leq j \leq n} \left\{\left\lfloor\frac{n}{j}\right\rfloor +j - 1\right\}\]

Take the derivative and solve for \(f'(j) = 0\) to find theminimum, which is \(j = \sqrt{n}\).In this case, the worst case cost will beroughly \(2\sqrt{n}\).

This example invokes a basic principle of algorithm design.We want to balance the work done while selecting a sublist with thework done while searching a sublist.In general, it is a good strategy to make subproblems of equal effort.This is an example of adivide and conquer algorithm.

What if we extend this idea to three levels?We would first make jumps of some size \(j\) to find a sublist ofsize \(j-1\) whose end values bracket value \(K\).We would then work through this sublist by making jumps of somesmaller size, say \(j_1\).Finally, once we find a bracketed sublist of size \(j_1 - 1\), wewould do sequential search to complete the process.

This probably sounds convoluted to do two levels of jumping to befollowed by a sequential search.While it might make sense to do a two-level algorithm (that is, jumpsearch jumps to find a sublist and then does sequential search on thesublist),it almost never seems to make sense to do a three-level algorithm.Instead, when we go beyond two levels, we nearly always generalize byusing recursion.This leads us to the most commonly used search algorithm for sortedarrays, the binary search.

3.3.1.2. Binary Search

You are probably pretty familiar with Binary Search already.So that we have a concrete example to discuss, here is animplementation.

Settings

3.3. Search in Sorted Arrays — Senior Algorithms (1) Saving... 3.3. Search in Sorted Arrays — Senior Algorithms (2)
Server Error
Resubmit

Of couse you know that Binary Search is far better than SequentialWhy would that be?Because we have additional information to work with that we do nothave when the list is unsorted.You probably already “know” that the standard binary search algorithmhas a worst case cost of \(O(\log n)\).Let’s do the math to make sure that it really is in\(O(\log n\), and see how to handle the nasty details of modelingthe exact behavior of a recursive algorithm.After that, we can deal with proving that Binary Search is indeedoptimal (at least in the worst case) for solving the problem of searchin a sorted list.

If we are willing to be casual about our analysis, we can reasonthat we look at one element (for a cost of one), and then repeat theprocess on half of the array.This would give us a recurrence that looks like\(f(n) = 1 + f(n/2)\).But if we want to be more precise, then we need to think carefullyabout what is going on in the worst case.First, we should notice that we are doing a little more than cuttingthe array in half.We never look again at a particular position that we test.For example, if the input size is nine, then we actually look atposition 4 (since \((9-0)/2 = 4\) when rounded down), and we theneither continue to consider four positions to the left(positions 0 to 3) or four positions to the right (positions 5 to 8).But what if there are ten element?Then we actually look at position 5 (since \((10-0)/2 = 5\)).We will then either need to continue dealing with five positions tothe left (positions 0 to 4), or four positions to the right.Which means that in the worst case, we are looking at a little lessthan half when the array size is odd, or exactly half when the arraysize is even.To capture this, we can use the floor function, to get an exact worstcase model as follows:

\[\begin{split}f(n) = \left\{\begin{array}{ll}1 & n=1\\f(\lfloor n/2 \rfloor) + 1 & n > 1\end{array}\right.\end{split}\]

Since \(n/2 \geq \lfloor n/2 \rfloor\),and since \(f(n)\) is assumed to benon-decreasing (since adding more elements won’t decrease the work)we can estimate the upper bound with the simplification\(f(n) = f(n/2) + 1\).

This recurrence is fairly easy to solve via expansion:

\[\begin{split}\begin{eqnarray*}f(n) &=& f(n/2) + 1\\&=& \{f(n/4) + 1\} + 1\\&=& \{\{f(n/8) + 1\} + 1\} + 1\end{eqnarray*}\end{split}\]

Then, collapse to

\[f(n) = f(n/2^i) + i = f(1) + \log n = \log n + 1\]

Now, we can prove that this is correct with induction.

By the IH, \(f(n/2) = \log(n/2) + 1\).

\[\begin{split}\begin{eqnarray*}f(n/2) + 1 &=& (\log(n/2) + 1) + 1\\&=& (\log n - 1 + 1) + 1\\&=& \log n + 1 = f(n).\end{eqnarray*}\end{split}\]

How do we calculate the average cost for Binary Search?This requires some modeling, because we need to know things about theprobabilities of the various inputs.We will estimate given these assumptions:

  1. \(X\) is in L.

  2. \(X\) is equally likely to be in any position.

  3. \(n = 2^k - 1\) for some non-negative integer \(k\).

What is the cost?

  • There is one chance to hit in one probe.

  • There are two chances to hit in two probes.

  • There are \(2^{i-1}\) chances to hit in \(i\) probes.

  • \(i \leq k\).

What is the resulting equation?

\[\frac{1\times 1 + 2\times 2 + 3 \times 4 + ... + \log n 2^{\log n-1}}{n}= \frac{1}{n}\sum_{i=1}^{\log n}i 2^{i-1}\]

Note that \(2^{\log n-1} = n/2\).

To solve the summation:

\[\begin{split}\begin{eqnarray*}\sum_{i=1}^k i2^{i-1} &=& \sum_{i=0}^{k-1}(i+1)2^i= \sum_{i=0}^{k-1} i 2^i + \sum_{i=0}^{k-1} 2^i\\&=& 2 \sum_{i=0}^{k-1} i 2^{i-1} + 2^k - 1\\&=& 2 \sum_{i=1}^{k} i 2^{i-1} - k 2^k + 2^k - 1\end{eqnarray*}\end{split}\]

Note that in the above series of equations, we change variables:\(i \rightarrow i+1\).

Now what? Subtract from the original!

\[\sum_{i=1}^{k} i 2^{i-1} = k 2^k - 2^k + 1 = (k - 1)2^k + 1.\]

Note that

\[\sum_{i=1}^k i 2^{i-1} = 2 \sum_{i=1}^k i 2^{i-1} - k 2^k + 2^k -1\]

So,

\[\begin{split}\begin{eqnarray*}\sum_{i=1}^k i 2^{i-1} &=& k2^k - 2^k +1\\&=& (k-1)2^k +1\end{eqnarray*}\end{split}\]

Now we come back to solving the original equation.Since we have a closed-form solution for the summation in hand, we canrestate the equation with the appropriate variable substitutions.

\[\begin{split}\begin{eqnarray*}\frac{1}{n}\sum_{i=1}^{\log n}i 2^{i-1} &=&\frac{(\log n - 1)2^{\log n} + 1}{n}\\&=& \frac{n (\log n -1) + 1}{n}\\&\approx& \log n - 1\end{eqnarray*}\end{split}\]

So the average cost is only about one or two comparisons less than theworst cost.

If we want to relax the assumption that \(n = 2^k - 1\), we getthis as the exact cost:

\[\begin{split}f(n) = \left\{\begin{array}{ll}0 & n=0\\1 & n=1\\\frac{\lceil \frac{n}{2} \rceil - 1}{n}f(\lceil \frac{n}{2}\rceil - 1) +\frac{1}{n} 0\ + \\\frac{\lfloor \frac{n}{2} \rfloor}{n}f(\lfloor \frac{n}{2} \rfloor) + 1&n > 1\end{array}\right.\end{split}\]

Identify each of the components of this equation as follows:

  • Left side: \(X < L[i]\)

  • \(L(i) == X\) has no additional cost, with chance \(1/n\)

  • Right side: \(X > L[i]\)

3.3.1.3. Lower Bounds Proof

So, \(O(\log n)\) time for Binary Search seems pretty good.Can we do better than this?We can prove that this is the best possible algorithm in the worstcase for searching in a sorted list by using a proof similar to thatused to show the lower bound on sorting.

We use the decision tree to model our algorithm.Unlike when searching an unsorted list, comparisons between elementsof L tell us nothing new about their relative order (since Lis already sorted), so we consider only comparisons between \(K\)and an element in L.At the root of the decision tree, our knowledge rules out no positionsin L, so all are potential candidates.As we take branches in the decision tree based on the result ofcomparing \(K\) to an element in L, we gradually rule outpotential candidates.Eventually we reach a leaf node in the tree representing the singleposition in L that can contain \(K\).There must be at least \(n+1\) nodes in the tree because we have\(n+1\) distinct positions that \(K\) can be in (any positionin L, plus not in L at all).Some path in the tree must be at least \(\log n\) levels deep, andthe deepest node in the tree represents the worst case for thatalgorithm.Thus, any algorithm on a sorted array requires at least\(\Omega(\log n)\) comparisons in the worst case.

We can modify this proof to find the average cost lower bound.Again, we model algorithms using decision trees.Except now we are interested not in the depth of the deepest node (theworst case) and therefore the tree with the least-deepest node.Instead, we are interested in knowing what the minimum possible is forthe “average depth” of the leaf nodes.Define the total path length as the sum of the levels for eachnode.The cost of an outcome is the level of the corresponding node plus 1.The average cost of the algorithm is the average cost of the outcomes(total path length / \(n\)).What is the tree with the least average depth?This is equivalent to the tree that corresponds to binary search.Thus, binary search is optimal in the average case.

While binary search is indeed an optimal algorithm for a sorted listin the worst and average cases when searching a sorted array, thereare a number of circ*mstances that might lead us to select anotheralgorithm instead.One possibility is that we know something about the distribution ofthe data in the array.If each position in L is equally likely to hold \(K\)(equivalently, the data arewell distributed along the full key range), then aninterpolation searchis \(\Theta(\log \log n)\) in the average case.If the data are not sorted, then using binary search requires us topay the cost of sorting the list in advance, which is only worthwhileif many (at least \(O(\log n)\) searches will be performed on thelist.Binary search also requires that the list (even if sorted) beimplemented using an array or some other structure that supportsrandom access to all elements with equal cost.Finally, if we know all search requests in advance, we might prefer tosort the list by frequency and do linear search in extreme searchdistributions, or use aself-organizing list.

3.3.1.4. Interpolation and Quadratic Binary Search

If we know nothing about the distribution of key values,then we have just proved that binary search is the bestalgorithm available for searching a sorted array.However, sometimes we do know something about the expectedkey distribution.Consider the typical behavior of a person looking up a word ina large dictionary.Most people certainly do not use sequential search!Typically, people use a modified form of binary search, at least untilthey get close to the word that they are looking for.The search generally does not start at the middle of the dictionary.People looking for a word starting with ‘S’generally assume that entries beginning with ‘S’ start about threequarters of the way through the dictionary.Thus, they will first open the dictionary about three quarters ofthe way through and then make a decision based on what is found as towhere to look next.In other words, people typically use some knowledge about theexpected distribution of key values to “compute” where to look next.This form of “computed” binary search is called adictionary search or interpolation search.In a dictionary search, we search L at a position \(p\) thatis appropriate to the value of \(K\) as follows.

\[p = \frac{K - \mathbf{L}[1]}{\mathbf{L}[n] - \mathbf{L}[1]}\]

This equation is computing the position of \(K\) as a fraction ofthe distance between the smallest and largest key values.This will next be translated into that position which is the samefraction of the way through the array,and this position is checked first.As with binary search, the value of the key found eliminatesall records either above or below that position.The actual value of the key found can then be used tocompute a new position within the remaining range of the array.The next check is made based on the new computation.This proceeds until either the desired record is found, or the arrayis narrowed until no records are left.

A variation on dictionary search is known as\(Quadratic Binary Search\) (QBS),and we will analyze this in detail because its analysis is easier thanthat of the general dictionary search.QBS will first compute (p) and then examine\(\mathbf{L}[\lceil pn\rceil]\).If \(K < \mathbf{L}[\lceil pn\rceil]\) then QBS will sequentiallyprobe to the left by steps of size \(\sqrt{n}\), that is, we stepthrough

\[\mathbf{L}[\lceil pn - i\sqrt{n}\rceil], i = 1, 2, 3, ...\]

until we reach a value less than or equal to \(K\).Similarly for \(K > \mathbf{L}[\lceil pn\rceil]\)we will step to the right by \(\sqrt{n}\) until we reach a valuein L that is greater than \(K\).We are now within \(\sqrt{n}\) positions of \(K\).Assume (for now) that it takes a constant number of comparisons tobracket \(K\) within a sublist of size \(\sqrt{n}\).We then take this sublist and repeat the process recursively.That is, at the next level we compute an interpolation to startsomewhere in the subarray.We then step to the left or right (as appropriate) by steps of size\(\sqrt{\sqrt{n}}\).

What is the cost for QBS?Note that \(\sqrt{c^n} =c^{n/2}\), and we will be repeatedlytaking square roots of the current sublist size until we find the itemthat we are looking for.Because \(n = 2^{\log n}\) and we can cut \(\log n\) in halfonly \(\log \log n\) times, the cost is \(\Theta(\log \log n)\)if the number of probes on jump search is constant.

Say that the number of comparisons needed is \(i\), in which casethe cost is \(i\) (since we have to do \(i\) comparisons).If \(\mathbf{P}_i\) is the probability of needing exactly \(i\)probes, then

\[\begin{split}\sum_{i=1}^{\sqrt{n}} i \mathbf{P}(\mbox{need exactly $i$ probes})\\= 1 \mathbf{P}_1 + 2 \mathbf{P}_2 + 3 \mathbf{P}_3 + \cdots + \sqrt{n} \mathbf{P}_{\sqrt{n}}\end{split}\]

We now show that this is the same as

\[\sum_{i=1}^{\sqrt{n}} \mathbf{P}(\mbox{need at least $i$ probes})\]

\[\begin{split}&=& 1 + (1-\mathbf{P}_1) + (1-\mathbf{P}_1-\mathbf{P}_2) + \cdots + \mathbf{P}_{\sqrt{n}}\\&=& (\mathbf{P}_1 + ... + \mathbf{P}_{\sqrt{n}}) + (\mathbf{P}_2 + ... + \mathbf{P}_{\sqrt{n}}) +\\&& \qquad (\mathbf{P}_3 + ... + \mathbf{P}_{\sqrt{n}}) + \cdots\\&=& 1 \mathbf{P}_1 + 2 \mathbf{P}_2 + 3 \mathbf{P}_3 + \cdots + \sqrt{n} \mathbf{P}_{\sqrt{n}}\end{split}\]

We require at least two probes to set the bounds, so the cost is

\[2 + \sum_{i=3}^{\sqrt{n}} \mathbf{P}(\mbox{need at least \(i\) probes}).\]

We now make take advantage of a useful fact known as Chebyshev’sInequality.Chebyshev’s inequality states that\(\mathbf{P}(\mbox{need exactly}\ i\ \mbox{probes})\),or \(\mathbf{P}_i\), is

\[\mathbf{P}_i \leq \frac{p(1 - p)n}{(i - 2)^2 n} \leq\frac{1}{4(i-2)^2}\]

because \(p(1-p) \leq 1/4\) for any probability \(p\).This assumes uniformly distributed data.Thus, the expected number of probes is

\[2 + \sum_{i=3}^{\sqrt{n}} \frac{1}{4(i-2)^2}< 2 + \frac{1}{4}\sum_{i=1}^\infty \frac{1}{i^2} =2 + \frac{1}{4}\frac{\pi}{6} \approx 2.4112\]

Is QBS better than binary search?Theoretically yes, because \(O(\log \log n)\) grows slower than\(O(\log n)\).However, we have a situation here which illustrates the limits to themodel of asymptotic complexity in some practical situations.Yes, \(c_1 \log n\) does grow faster than \(c_2 \log \log n\).In fact, it is exponentially faster!But even so, for practical input sizes, the absolute cost differenceis fairly small.Thus, the constant factors might play a role.First we compare \(\log \log n\) to \(\log n\).

\[\begin{split}\begin{array}{llll}&&&{\rm Factor}\\n &\log n&\log \log n&{\rm Difference}\\\hline16 &4 &2 &2\\256&8 &3 &2.7\\2^{16}&16 &4 &4\\2^{32}&32 &5 &6.4\\\end{array}\end{split}\]

It is not always practical to reduce an algorithm’s growth rate.There is a “practicality window” for every problem, in that we havea practical limit to how big an input we wish to solve for.If our problem size never grows too big, it might not matter if we canreduce the cost by an extra log factor, because the constant factorsin the two algorithms might differ by more than the log of the log ofthe input size.

For our two algorithms, let us look further and check the actualnumber of comparisons used.For binary search, we need about \(\log n-1\) total comparisons.Quadratic binary search requires about \(2.4 \log \log n\)comparisons.If we incorporate this observation into our table, we get a differentpicture about the relative differences.

\[\begin{split}\begin{array}{llll}&&&{\rm Factor}\\n &\log n -1&2.4 \log \log n&{\rm Difference}\\\hline16&3&4.8&{\rm worse}\\256&7&7.2&\approx {\rm same}\\64K&15&9.6&1.6\\2^{32}&31&12&2.6\end{array}\end{split}\]

But we still are not done.This is only a count of raw comparisons.Binary search is inherently much simpler than QBS,because binary search only needs to calculate the midpoint position ofthe array before each comparison, while quadratic binary search mustcalculate an interpolation point which is more expensive.So the constant factors for QBS are even higher.

Not only are the constant factors worse on average, but QBSis far more dependent than binary search on good datadistribution to perform well.For example, imagine that you are searching a telephone directory forthe name “Young”.Normally you would look near the back of the book.If you found a name beginning with ‘Z’, you might look just a littleways toward the front.If the next name you find also begins with ‘Z’ you would look alittle further toward the front.If this particular telephone directory were unusual in that half ofthe entries begin with ‘Z’, then you would need to move towardthe front many times, each time eliminating relatively few recordsfrom the search.In the extreme, the performance of interpolation search might not bemuch better than sequential search if the distribution of key valuesis badly calculated.

While it turns out that QBS is not a practical algorithm,this is not a typical situation.Fortunately, algorithm growth rates are usually well behaved, so thatasymptotic algorithm analysis nearly always gives us a practicalindication for which of two algorithms is better.

3.3. Search in Sorted Arrays — Senior Algorithms (2024)
Top Articles
What is Solana Buy or Sell 2023 forecast | Crypto Coins: SOL - Macroaxis
Solana Tumbles Again, Bringing Crypto Token’s 2022 Plunge to 94%
Why Are Fuel Leaks A Problem Aceable
Amc Near My Location
Practical Magic 123Movies
Www Craigslist Louisville
2022 Apple Trade P36
Volstate Portal
Steve Strange - From Punk To New Romantic
Jasmine
Gina's Pizza Port Charlotte Fl
How To Delete Bravodate Account
Simple Steamed Purple Sweet Potatoes
Ree Marie Centerfold
Bjork & Zhulkie Funeral Home Obituaries
Saberhealth Time Track
24 Best Things To Do in Great Yarmouth Norfolk
Check From Po Box 1111 Charlotte Nc 28201
Accident On May River Road Today
Ge-Tracker Bond
Pjs Obits
Hewn New Bedford
Kirsten Hatfield Crime Junkie
Doctors of Optometry - Westchester Mall | Trusted Eye Doctors in White Plains, NY
Truck from Finland, used truck for sale from Finland
Speechwire Login
N.J. Hogenkamp Sons Funeral Home | Saint Henry, Ohio
What Is Opm1 Treas 310 Deposit
Deepwoken: Best Attunement Tier List - Item Level Gaming
Evil Dead Rise - Everything You Need To Know
James Ingram | Biography, Songs, Hits, & Cause of Death
Beaver Saddle Ark
All Things Algebra Unit 3 Homework 2 Answer Key
R&J Travel And Tours Calendar
USB C 3HDMI Dock UCN3278 (12 in 1)
One Main Branch Locator
2020 Can-Am DS 90 X Vs 2020 Honda TRX90X: By the Numbers
Verizon Outage Cuyahoga Falls Ohio
Cl Bellingham
US-amerikanisches Fernsehen 2023 in Deutschland schauen
Arnesons Webcam
Advance Auto.parts Near Me
Vci Classified Paducah
Minecraft: Piglin Trade List (What Can You Get & How)
Acuity Eye Group - La Quinta Photos
Zits Comic Arcamax
Bama Rush Is Back! Here Are the 15 Most Outrageous Sorority Houses on the Row
Definition of WMT
Brutus Bites Back Answer Key
Gelato 47 Allbud
Gear Bicycle Sales Butler Pa
Latest Posts
Article information

Author: Nathanael Baumbach

Last Updated:

Views: 5693

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Nathanael Baumbach

Birthday: 1998-12-02

Address: Apt. 829 751 Glover View, West Orlando, IN 22436

Phone: +901025288581

Job: Internal IT Coordinator

Hobby: Gunsmithing, Motor sports, Flying, Skiing, Hooping, Lego building, Ice skating

Introduction: My name is Nathanael Baumbach, I am a fantastic, nice, victorious, brave, healthy, cute, glorious person who loves writing and wants to share my knowledge and understanding with you.