The Role of Indexing in Database Performance: How to Speed Up Your Queries
Ever heard of database indexes and wondered what they are? Well, in this article, we'll dive deep into the concept of indexing—exploring what indexes are, how they work, and why they're essential for optimizing database performance.
Indexing in a database is a data structure technique used to improve the speed and efficiency of data retrieval operations.
It works by creating a separate data structure that stores a copy of specific columns from a table, along with pointers to the corresponding records. This allows the database management system to quickly locate and access the desired data without scanning the entire table.
Indexes are particularly useful for large tables and frequently queried columns. They can significantly reduce query execution time, especially for operations like searching, sorting, and joining tables. However, indexes also come with some trade-offs. They require additional storage space and can slow down data modification operations (inserts, updates, and deletes) since the index structure must be updated along with the table data.
Indexes are most efficient when read operations outweigh write operations, so it's important to keep that in mind when deciding whether to use them. Database administrators need to carefully consider which columns to index based on the specific needs of their applications and query patterns.
Database systems offer a variety of index types, each optimized for specific data structures and query patterns. While the landscape of indexing is diverse and complex, this article will concentrate on the most fundamental and widely used type: the B-tree index. B-tree indexes serve as the backbone of database performance optimization and are implemented as the default index structure in many relational database management systems (RDBMS) like PostgreSQL, MySQL, and Oracle. By focusing on this essential index type, we'll establish a solid foundation for understanding how indexes improve query performance and data retrieval efficiency. As you progress in your database knowledge, you'll encounter other specialized index types such as hash indexes, bitmap indexes, and full-text indexes, each designed to excel in particular scenarios.
Index Analogy: The Phone Book
Imagine a phone book without any alphabetical order. To find someone's number, you'd have to scan through every page. This is like a database table without an index. Now, picture a phone book organized alphabetically by last name. You can quickly flip to the right section to find a person. This is similar to how an index works in a database.
Database Indexing Example
-- Create a table for storing customer information
CREATE TABLE customers (
id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
registration_date DATE
);
-- Insert some sample data
INSERT INTO customers VALUES
(1, 'John', 'Doe', 'john.doe@example.com', '2023-01-15'),
(2, 'Jane', 'Smith', 'jane.smith@example.com', '2023-02-20'),
(3, 'Bob', 'Johnson', 'bob.johnson@example.com', '2023-03-10');
-- Query without index
SELECT * FROM customers WHERE last_name = 'Smith';
-- Create an index on the last_name column
CREATE INDEX idx_last_name ON customers(last_name);
-- Query with index (same query, but now it will use the index)
SELECT * FROM customers WHERE last_name = 'Smith';
We start by creating a customers
table with columns for id
, first_name
, last_name
, email
, and registration_date
. After inserting some sample data, we run a query to find customers with the last name 'Smith'. Without an index, the database would need to perform a full table scan, checking each row individually to find the match.
To optimize this, we create an index on the last_name
column. This index helps the database by creating a separate data structure that organizes the last_name
values in a way that allows for quicker lookups. It also stores pointers to the corresponding rows in the table.
When we run the same query again, the database uses the index to locate 'Smith' much faster by skipping the full scan and going directly to the relevant rows.
Key points to understand:
- Indexes are most useful for columns that are frequently used in
WHERE
clauses,JOIN
conditions, or for sorting (ORDER BY
). - Not every column needs an index. Over-indexing can slow down write operations and consume additional storage.
- Deciding which columns to index depends on your specific queries and data access patterns.
- Indexes are particularly beneficial for large tables, while for small tables, the overhead of maintaining an index may not provide much benefit.
- Some databases automatically create indexes for primary key columns. (serial in PostgreSQL)
Index as data structure
- Root Node (Blue):
- The top-level node containing the value 50.
- It divides the tree into two main sections: values less than 50 and values greater than or equal to 50.
- Internal Nodes (Green):
- The second level of nodes containing multiple values (20, 35) and (70, 85).
- These further divide the tree into smaller sections.
- Leaf Nodes (Yellow):
- The bottom level of nodes containing the actual indexed values.
- Each leaf node points to the actual data in the table.
- Pointers:
- Represented by lines connecting the nodes.
- In a real B-tree, these would be memory addresses pointing to child nodes or data records.
- Data Pointers:
- Indicated by "→ Data" text below leaf nodes.
- In a real index, these would point to the actual rows in the table.
How it works:
- When searching for a value, the database starts at the root node.
- It compares the search value with the values in the node and decides which branch to follow.
- This process repeats at each level until it reaches a leaf node.
- In the leaf node, it either finds the value or determines it doesn't exist.
- If found, it uses the data pointer to retrieve the full row from the table.
For example, if searching for the value 30:
- Start at root (50): 30 < 50, so go left.
- Check second level (20, 35): 20 < 30 < 35, so go to middle child.
- Reach leaf node (25, 30): Find 30 and use its data pointer to fetch the full record.
This structure allows the database to quickly narrow down the search space, making queries much faster than scanning the entire table, especially for large datasets.
Lets do some benchmarking !
We will create the users
table.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
firstname VARCHAR(100) NOT NULL,
lastname VARCHAR(100) NOT NULL
);
and generate 1 million random users
postgres=# DO $$
DECLARE
i INT;
user_email TEXT;
firstname TEXT;
lastname TEXT;
BEGIN
FOR i IN 1..1000000 LOOP
-- Generate random first and last names
firstname := LOWER(SUBSTRING(md5(random()::text), 1, 8)); -- 8-character random string
lastname := LOWER(SUBSTRING(md5(random()::text), 1, 8));
-- Generate random email
user_email := LOWER(SUBSTRING(md5(random()::text), 1, 8)) || '@example.com';
-- Insert the generated email, first name, and last name, and skip duplicates
INSERT INTO users (email, firstname, lastname)
VALUES (user_email, firstname, lastname)
ON CONFLICT (email) DO NOTHING;
END LOOP;
END $$;
Now lets check if we have 1 million users
postgres=# select count(*) from users;
count
---------
1000000
(1 row)
postgres=#
Lets check the size of it
postgres=# SELECT pg_size_pretty(pg_total_relation_size('users'));
pg_size_pretty
----------------
145 MB
(1 row)
postgres=#
Let's query one user and explain analyze the query before creating an index on firstname
, since id
has an index, and email
is unique, so it also has an index on it.
first we find the candidate
postgres=# select * from users where id=847567;
id | email | firstname | lastname
--------+----------------------+-----------+----------
847567 | 5851d61a@example.com | c004a390 | 041fd35f
(1 row)
and let's do a query:
postgres=# EXPLAIN ANALYZE SELECT * FROM users WHERE firstname = 'c004a390';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..15552.88 rows=1 width=43) (actual time=51.377..54.493 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on users (cost=0.00..14552.78 rows=1 width=43) (actual time=40.495..42.406 rows=0 loops=3)
Filter: ((firstname)::text = 'c004a390'::text)
Rows Removed by Filter: 333298
Planning Time: 0.089 ms
Execution Time: 54.524 ms
(8 rows)
postgres=#
The EXPLAIN ANALYZE output reveals that PostgreSQL is performing a Parallel Sequential Scan on the users table. This is evident from the "Parallel Seq Scan on users" line, indicating that the database is scanning the entire table across multiple worker processes (2 workers launched). The execution time of 54.524 ms represents the total time taken to complete the query. Notably, the scan processed a total of 999,894 rows (333,298 * 3 loops) to find the single matching row, suggesting that creating an index on the firstname column could significantly improve query performance for this type of search.
Let's create an index now
CREATE INDEX idx_users_firstname ON users(firstname);
lets check the size now after creating and index.
postgres=# SELECT pg_size_pretty(pg_total_relation_size('users'));
pg_size_pretty
----------------
175 MB
(1 row)
postgres=#
We can see an increase in storage usage of approximately 20% with a total difference of 30 MB. And let's repeat the query now:
postgres=# EXPLAIN ANALYZE SELECT * FROM users WHERE firstname = 'c004a390';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_users_firstname on users (cost=0.42..8.44 rows=1 width=43) (actual time=0.071..0.076 rows=1 loops=1)
Index Cond: ((firstname)::text = 'c004a390'::text)
Planning Time: 0.100 ms
Execution Time: 0.104 ms
(4 rows)
postgres=#
After creating an index on the firstname column, the query execution time dramatically improved from 54.524 ms to just 0.104 ms - a speedup of over 500 times. The query plan now shows an "Index Scan using idx_users_firstname on users" instead of a Parallel Sequential Scan, indicating that PostgreSQL is utilizing the newly created index. This allows the database to quickly locate the matching row without scanning the entire table, resulting in significantly faster query execution and demonstrating the powerful impact of proper indexing on query performance.
In this exploration of managing a semi large dataset in PostgreSQL, we successfully inserted one million random user records, demonstrating the system's capability to handle substantial volumes of data. After the insertion, the total size of the users table reached 145 MB, which gives an idea of the space required for such a dataset.
We didn’t just count the records; we also looked at query performance. Before creating an index on the firstname column, querying for a specific name resulted in a slower execution time of around 54.5 milliseconds. This delay was primarily due to the need to scan through a large number of rows.
After creating the index on the firstname column, we saw an increase in the size of the users table to 175 MB—an increase of about 20%. However, this was a small trade-off considering the improvement in query performance. The execution time for the same query dropped significantly to just 0.104 milliseconds, showing how effective indexing can be.
This comparison illustrates the benefits of indexing in database management. By optimizing data retrieval with indexes, we can enhance application performance. As databases grow larger, understanding how to implement effective indexing strategies becomes increasingly important for developers and database administrators.
In summary, this experiment highlights the mechanics of data generation and indexing in PostgreSQL while underscoring the value of indexing for improving query performance and overall user experience.