What is an Index?
Databases are able to quickly retrieve information seemingly regardless of how large they become. While they have many techniques that help them do so, none of them would be useful without indexes powering them underneath. This article takes a brief look at what an index is and why they can be so helpful for databases.
For this demonstration, consider the following data:
ID | name | total |
---|---|---|
1 | Bob | 0 |
2 | Carol | 9 |
3 | Alice | 3 |
4 | Bob | 2 |
5 | Carol | 1 |
6 | Bob | 4 |
7 | Carol | 2 |
Our user now wants to find a specific piece of information, namely the item that has a name
of 'Bob'
and a total
of 2
. In SQL, this would be expressed as:
Without any indexes, the data will simply be sitting unorganized in the database. So in order for it to find the requested information, it must sequentially scan the records filtering for the values of interest:
Here, and across the site, we use IndeX-Ray™ to provide imaging from the database. This technology allows us to see how the database actually uses an index to find data. Check out the interactive diagrams here in the article directly or open any of them in the playground directly to explore (and modify) them further.
What we see above is that the database spent time scanning all 7 records in the table but ended up discarding 6 of them since they did not match the query. Also notice here that even after the database found the single matching record it still had to continue its scan through the rest of the table. This is because it didn't know that there weren't any other matching records that it might find later which the user would also be interested in. When looking through unorganized data, the items of interest could be anywhere.
Now let's create an index on the name
column and see how things look:
Be sure to click the play button to see the database in action!
Now that we have an index on the name
column, the database is able to take a much different approach to finding the record of interest. No longer is it searching through all 7 records in the table, instead it focuses on only the 3 records which have a name
of 'Bob'
. It does this by walking the index 'down and over':
- An index scan always starts at the
root
node of the tree and goes down to the specific section of the index where the first item of relevance to the query could be found. - Once it reaches this bottom layer, which are known as leaf nodes, the database can then walk directly between neighboring nodes until it exits the section of relevance. In this example the database will scan along subsequent leaf nodes for those which contain a
name
of'Bob'
stopping once it encounters the'Carol'
node. Since the index is well-ordered the database can safely stop scanning for more records at this point knowing that there can be no other relevant records outside of this small section of the index.
We see one other interesting thing above which is that all 3 of the 'Bob'
records were scanned even though only 1 matched the query and belonged in the results. This is related to the fact that our query also included the following condition:
Since the total
column is not part of this index, the database does not know what value the records have for that column without accessing them directly. All it can tell from reading the leaf nodes of the index is that there are some candidate records which have the correct value for name
('Bob'
in this example), but this index cannot say with certainty whether the record will match all of the query conditions or not. This is why the database had to scan the records which had id
s of 1
and 4
even though it discarded them since they were not part of the result set.
You could similarly create an index on the total
field instead, which would look as follows:
As before, only one of the two columns being filtered on is present in the index, so it ends up scanning an extra record. It is a different extra record than before though, as it is id: 6
which is a candidate based on its value of 2
for the indexed total
column. How effective a single column index is depends on how selective the values in the column are as compared to any other filters that may also be present in the query.
But indexes are not just limited to a single column! Let's try creating a compound (also called multicolumn or composite) index using both of the columns from the WHERE
clause:
Now the leaf nodes contain both of the columns being filtered against, so the database can scan just the single matching record. This is very efficient and can help the database scale very well over time.
Need to go deeper?
- Use IndeX-Ray™ to get a free checkup of your index, no waiting required.
- Check out how indexes can return data directly.
- Subscribe to the newsletter to hear when new articles like this are published and to keep up to date with other exciting announcements.