SELECT *: A Query Shortcut or a Database Nightmare in MySQL?

Rahul Srivastava

Jun 29, 20233 min read

In the vast world of MySQL, developers and database administrators often encounter the ubiquitous SELECT * query. At first glance, it seems like a convenient shortcut to retrieve all columns from a table with a single command. However, what appears to be a time-saving solution can quickly turn into a database nightmare. In this blog post, we will delve into the hidden dangers and potential pitfalls of using SELECT * in MySQL. We'll explore why this seemingly innocent query can wreak havoc on performance, scalability, and maintainability.

According to Wikipedia a database is an organized collection of data stored and accessed electronically. Generally databases store data on filesystems. Now data is stored and read in filesystems as blocks or pages. Hence we can say databases also store their data in blocks. Different database may have different block sizes which can be easily configured. Each block can contain mutiple rows of a database table. So whenever we query the database to fetch some rows the database actually read these memory blocks from disk in order to retrive the data. But since we are reading the entire data block at once and block contains the entire row data so we are eventually fetching all the columns of the rows even though we query just for a single column. So whats the fuss of not using select * when we are already getting the complete data. Let us visualize this from a little closer.

Serialization and Deserialization

As we know data is stored in the filesytem in the form of bits. In order to return the result in the form expected by the user the DBMS has to interact with OS to convert this stream of bits into some structured data which can be returned to the user. Modern systems dont have much trouble doing this serialization and deserialization but at scale this does contribute to the response time of the sql query.

Handling Off-record Columns

Another key feature of MySQL is that it does not store columns with large data inline. Inline here means in the same block along with other columns of the row. For better optimizations DBMS stores large columns in separate blocks of the disk and saves the location of the data inline. This helps in storing more number of rows in a single block. But when we fire a select * query then the DBMS needs to go and fetch all those blocks where this large data is written which increases IO and more IO results in increased query response time.

Forget about Index-Only Scans

Some database engines have the provision of declaring a non-key column while creating an index on the column. So lets say we have a table for movie data in which the index is created on the release_year and movie_name is the non-key. Lets say you wanted to get all movie names which were released in the year 2015. This information is available in the index itself and hence the DBMS needs to load just the blocks containing index which will be very few.

But as we do select * the DBMS can no longer take the advantage of index containing blocks only. It has to load all those blocks which contains the rows satifying the given condition. This ultimately slows up the query execution.

Network Latency

Once the data is fetched from the disk blocks and serialized by the CPU they are now sent to the client over the network through TCP/IP. Data is sent in the form of segments. More data means more segments need to pass on the network. This increases the cost of transmission and thereby increases the latency cost.

Conclusion

Its always adviced to declare the columns while doing a select query. The above points suggests that select * has many complex activities which greatly increse the IO costs and hence we must be very careful and must use select * only when its required or when the table structure is simple and with less columns.