On The Line: Query Performance

In an earlier post I mentioned that I do a lot of SQL work at my present job. In the course of this work I often have to form complex queries using sub selects in joins. Subs-selects have lots of uses but they come with their own issues.

Consider the following:

SELECT c.custaccount

, c.custaddress

, o.ordid

FROM customers c

JOIN (SELECT

MAX(o.orderid) ordid, o.custaccount

FROM Orders o

WHERE o.orderdate BETWEEN

'01-01-2010' AND '03-01-2010'

GROUP BY o.custaccount) lastorder ON lastorder.custaccount = c. custaccount

This is a pretty standard query, basically we're pulling all customers who had an order between Jan-1-2010 and March-1-2010 and further we're getting the orderid of the customers last order in that time frame. This is a useful query, but isn't the best performing. In cases where the number of records in either table are fairly small there's no real reason to re-write this query.

BUT if the customer table has millions of records with lots of orders per customer this query will take considerable time to run. what can be done in this case? Keep in mind that in queries with sub-selects each select through the customer table (in this example) will require the sub-select query to select over the entire order table grouping by the custaccount. This means the query will re-query the orders table over and over again. How can this be done better?

Consider this:

SELECT

MAX(o.orderid) ordid, o.custaccount

INTO #t1

FROM Orders o

WHERE o.orderdate BETWEEN

'01-01-2010' AND '03-01-2010'

GROUP BY o.custaccount

SELECT c.custaccount

, c.custaddress

, o.ordid

FROM customers c

JOIN #t1 t on t.custaccount = c. custaccount

DROP TABLE #t1

This query is much faster by virtue of the fact that we first pull the MAX order id, grouping by the account between '01-01-2010' AND '03-01-2010' into a temporary table, THEN we perform the select. This SQL is much faster because we are not constantly re-querying the same table over and over again. We can clean this up even more by using a CTE. Use of the CTE removes the need to drop the temp table.

WITH maxorder (OrderId, CustomerAccount)

(

SELECT

MAX(o.orderid) ordid, o.custaccount

INTO #t1

FROM Orders o

WHERE o.orderdate BETWEEN

'01-01-2010' AND '03-01-2010'

GROUP BY o.custaccount

)

SELECT c.custaccount

, c.custaddress

, o. OrderId

FROM customers c

JOIN maxorder o on o. CustomerAccount = c. CustomerAccount

If you do any business development at all odds are you have to create, read, update and delete data (CRUD) out of a database of some type.

At my present job we write a lot of T-SQL. In my two years there I've been lucky enough to learn some nice tricks from some really talented SQL coders who have graciously shared their SQL insights with me. In this post I'd like to cover three simple tricks to keep in mind when writing SQL queries.

I) Queries with both inner and outer joins

Consider:

SELECT * FROM tblCustomer c

LEFT OUTER JOIN tblPhones p ON p.CustomerID = a.CustomerID

INNER JOIN tblAddress a ON c.CustomerID = a.CustomerID

WHERE p.PhoneNumber IS NULL AND a.State = 'OH'

This is a pretty standard query when you want to find all (in this case) customers in Ohio without phone numbers. This query will work, but its not very well written. Consider the order of the joins - in this query we have two joins - an outer and an inner one with the outer join first in the sequence. What will happen is the following.

So say tblCustomer has 1000 customers, tblPhones has 2000 phone records, we end up pulling a lot of data up front before we start limiting the size of the dataset. Its not until we inner join to tblAddress, with 300 addresses that we start trimming down the data. The criteria of the where clause further limits the dataset to the 50 addresses in Ohio and then to the 10 customers in Ohio don't have a phone.

Consider this version:

SELECT * FROM tblCustomer c

INNER JOIN tblAddress a ON c.CustomerID = a.CustomerID

LEFT OUTER JOIN tblPhones p ON p.CustomerID = a.CustomerID

WHERE p.PhoneNumber IS NULL AND a.State = 'OH'

So now tblCustomer still has 1000 customers, but the inner join with tblAddress now immediately restricts the data set to 300, pulling a lot less data up front. The outer join still pulls 2000 phones and the where clause will then further restrict the data set to 10. As you can see this query is much better because we pull less data up front. By moving all inner joins above any outer joins will greatly decrease query time and make the query much more efficient.

II) Move Your Criteria Out of the WHERE clause.

Consider again:

SELECT * FROM tblCustomer c

INNER JOIN tblAddress a ON c.CustomerID = a.CustomerID

LEFT OUTER JOIN tblPhones p ON p.CustomerID = a.CustomerID

WHERE p.PhoneNumber IS NULL AND a.State = 'OH'

This is better than where we started from, but we can do better. Note that in the where condition we have the following: AND a.State = 'OH'. This criteria targets the tblAddress table, but does it have to be in the WHERE statement?

Consider now:

SELECT * FROM tblCustomer c

INNER JOIN tblAddress a ON c.CustomerID = a.CustomerID AND a.State = 'OH'

LEFT OUTER JOIN tblPhones p ON p.CustomerID = a.CustomerID

WHERE p.PhoneNumber IS NULL

By moving the State criteria to the join the following happens. tblCustomer still has 1000 customers, but now the inner join and the State = 'OH' criteria with tblAddress now restricts the data set to 10 instead of 300. Again this query is much better because we pull much less data up front. By moving what criteria you can out of the where clause you will greatly decrease query time.

III) Do you really need to SELECT * from those tables?

Many developers will write SELECT * on select queries when they really need only a few fields from the tables they are querying. If you limit SELECTs to just that data you want to retrieve, you can increase your query's performance, particularly when you query tables whit a lot of columns.

So to recap:
-Inner Joins before Outer Joins.

-Move as much of your queries criteria to the joins as possible.

-No SELECT * if you can help it.

Simple tricks, but applying them can make a big difference.

My thanks to my friends Alan Huffman and John Anderson who have been unstinting with SQL knowledge. Check out Alan's Blog Southern BITS of Technical Goodness.

On The Line

Monday, March 8, 2010

More Fun With SQL – Sub-Selects & Using Temp Tables to Make Sub-Selects Faster

Wednesday, February 24, 2010

Three T-SQL Tricks Courtesy of Some Friends

About Me

Followers

Blog Archive