Blog

Care, for your database

← Blog

Don’t (Always) Use The Index

8 min read

Takeaway: Indexes are incredible, but not magical. There are times when they should not be used. Forcing a database to use one when it doesn’t naturally want to could cause performance to get worse.

Why it matters: The go-to advice when there is a query performance problem is ‘use an index’. While this is often correct, it is not always the solution. Having a basic understanding of when indexes are unable to improve the performance of a query will shift the focus toward more actionable steps.


A Database Secret

Here’s one of the best kept secrets about databases:

Sometimes there is no index that will speed up a query.

On the surface, this may sound ridiculous if not down right blasphemous. Indexes can speed up a query by several orders of magnitude and the general guidance whenever a slow query is encountered is to Use The Index. Indeed that is the proper rule of thumb and the same advice provided here. That doesn’t mean that it is always the solution. Let’s take a look at why that is.

Example Boolean Query

Consider the following query:

With conventional wisdom in mind, the first response after identifying this slow query may be to create the index of (active). After doing so we observe that the optimizer is choosing not to use it! What is going on?

Given that index, the two basic approaches the database could pursue in order to execute this query would be:

  1. Scan only the portion of the (active) index which contains true values and then fetch the associated records from the table
  2. Scan every record in the table discarding all of those which do not contain a true value for the active column

First Approach (Index Scan)

Let's confirm that the first approach, which scans an index, is doing what we expect:

Here, and across the site, we use IndeX-Ray™ to provide imaging from the database. This technology allows us to see how the database actually uses an index to find data. Check out the interactive diagrams here in the article directly or open any of them in the playground directly to explore (and modify) them further.

Above we can see that the database was able to successfully limit its search to only the records of interest to the client (those that match active=true). This is great! The database did not have to waste any effort scanning excess records that the client was not interested in.

So why does the database seem to disagree and choose the other approach instead?

Second Approach (Table Scan)

Let's take a look at the second option that the database has, which is to scan the full table:

Be sure to click the play button to see the database in action!

This second approach of scanning the entire table was less efficient than the first in the sense that all records, including those which are not active, were scanned. But looking at this fact in isolation is misleading and may cause us to try to solve the problem in the wrong way.

We can explore this further by comparing the two approaches more directly. Check out the extra steps (work) that the database does when scanning the index:

Unintuitively and even though it is "more efficient", using the index is slower than foregoing it and scanning the records in the table directly! This means that the optimizer's decision to avoid using the index is correct as forcing it to use the index by hinting it makes the query slower.

Result Set Size Matters

This begs the question:

What is happening and are indexes useless?

The reason for the behavior above comes down to what an index is and how it can (and cannot) be used. Keep in mind that it is not free or instantaneous to traverse through an index to find a piece of information. The cost to use an index can be relatively small when compared to searching through a full and unordered set of records1 in a table, but there is additional overhead nonetheless. In general the smaller the result set is the faster and more desirable using an index will be.

We can demonstrate this point by flipping our query around to check for false values instead:

Now the query executes more efficiently (and faster) by using the index than scanning the table. This simple example helps demonstrate that the same query can have different optimal execution plans that are dependent on the values used in the query and the data in the table.

The change in performance here is associated with a ‘tipping point’. This particular tipping point is when the size of the result set crosses a threshold relative to the total size of the table. Our latest example above contains a result set for false records that represents a pretty large 22% of the total records. Even so using the index is was still faster at that point. Somewhere between 22% and 78% the result set would cross the threshold where the table scan ends up being faster in this simulation. Use the link in the diagram above to head over to the playground and try to find what that thershold is!


Summary

If the database is choosing to execute a query without using an index, it is doing so for some specific reason. It is generally a mistake to force the database into using some other index or plan without better understanding the reasoning for the optimizer’s plan selection in the first place.

Using the information available to it, the database will select a plan that it expects is the most efficient for any given operation. That does not mean that any of the plans available to it are efficient. Sometimes there simply is no ‘good’ (cheap) plan. If the optimizer is electing to perform a collection scan it is possible that there is no better plan available to it. What steps can or should be taken to improve performance depend on the specifics of the situation.

Need to go deeper?

  • Use IndeX-Ray™ to get a free checkup of your index, no waiting required.
  • Check out how indexes can return data directly.
  • Subscribe to the newsletter to hear when new articles like this are published and to keep up to date with other exciting announcements.

Stay updated


Footnotes

  1. Unordered with respect to the query predicate(s) that we are searching on. Regardless of storage for the data, most databases have some sort of internal identifier for records that monotonically increases. That identifier is unrelated, and therefore unhelpful, when it comes to finding records that match filters on other arbitrary column values.