There are some T-SQL best practices that I have been hearing about for so long that it surprises me when I run into people who don’t know about them. Then I remember “every single person in the world starts out with absolutely zero knowledge about SQL Server“. With that in mind I have decided to document some of the things I have been seeing a lot of recently. I have been working on a project refactoring T-SQL for the past several months and one of the patterns I replace without thinking about it is
Why is SELECT * such a bad practice? The first reason seems intuitive to me. You don’t know what you are getting. For a stable application this may not seem to matter, but we all know that things change. If the schema of the underlying table changes then the results returned by select * will suddenly be different. By explicitly listing the columns in a query you know that an application won’t be broken by a column getting added or an unused column getting removed. In addition it lets new developers on the project see what is actually being returned without needing to reference the schema every five minutes. Having readable code is reason enough in my opinion, but some people need convincing so I have two more reasons.
The second reason is also easy to see when pointed out. Don’t get more data than you need. If you have a table with twenty columns and you only need two, selecting all of the columns brings back 18 extra columns of data. If one of those columns is a large data type like a BLOB used for storing images or just large chunks of text then you are bringing back a lot more data than you need.
A quick example shows how much extra data could be brought back. If we had a table defined like this:
CREATE TABLE FizzBuzz_big
( FizzBuzz_Value INT PRIMARY KEY NOT NULL
,FizzBuzz_Text CHAR(8) NOT NULL
,FizzBuzz_BigInt BIGINT NOT NULL
,FizzBuzz_BigChar CHAR(5096) NOT NULL )
and we only wanted the first two columns, Fizzbuzz_value and Fizzbuzz_text
SELECT Fizzbuzz_value, Fizzbuzz_text FROM FizzBuzz_big
we would be retrieving 16 Bytes of data (4 for the integer, 8 for the char, 4 for a row header). If instead we select all the columns
SELECT * FROM FizzBuzz_big
we would be retrieving 5120 Bytes (5 MB) of data.
For simplicity I avoided variable length columns, but with a varchar column you would want to assume the possibility of all the bytes being used. Which means if I had included a varchar(max) column that is a potential extra 2GB per record.
The third reason SELECT * is a bad thing when querying a SQL Server database is a little more involved. You lose the ability for the query optimizer to use a covering index to retrieve the data. In the example above the query plans look identical. Both queries do a scan of the clustered index on FizzBuzz_Value.
But if we add an index that covers just the columns we want.
CREATE INDEX idx_JustWhatINeed ON FizzBuzz_big (FizzBuzz_Value)
The query plan for just the columns we want is able to do a scan of the new nonclustered index and becomes much more efficient. Your mileage may vary depending on how many rows are in the table, but this is what I got with 100,000 rows on the table.
I admit that in a real world scenario you are probably going to have a good idea of what is getting returned from select *. By the time code using select * is in production it has been tested and everyone on the project probably knows the data. I also admit that most likely the table schema isn’t going to change and if there are large columns on the table you might see some performance hits in testing and be more careful about selecting your columns. If things do change you will probably have other code refactoring to do for your application, but none of that is a good reason not to change select * into the columns you actually want. It is such an easy proactive thing to type in the column names – why wouldn’t you?