May 1, 2008

SQL [NOT] EXISTS Goes Zoom

PostgreSQL - Web/Internet | Trackback | del.icio.us | digg

I use fairly vanilla SQL most of the time so I was shocked to read this post on the PostgreSQL list when I ran into some slowness with a NOT IN query. While my experience is with Postgres, apparently this works with most databases.

If you have a NOT IN with a subquery that returns lots of rows, an EXISTS or NOT EXISTS will usually be faster. And not just a little bit. Like by a factor of 200:1 in my NOT EXISTS case! I am doing batch modification to a new schema and used this query to get a count of the remaining records while I was running my conversion scripts:

SELECT count(*) as total FROM tblLookupAuthenticateOld WHERE uidMember NOT IN (SELECT uidMember FROM tblLookupAuthenticate)

The EXPLAIN looks like:

"Aggregate (cost=9033491.00..9033491.01 rows=1 width=0)" " -> Seq Scan on tbllookupauthenticateold (cost=1772.98..9033453.73 rows=14910 width=0)" " Filter: (NOT (subplan))" " SubPlan" " -> Materialize (cost=1772.98..2304.14 rows=29816 width=36)" " -> Seq Scan on tbllookupauthenticate (cost=0.00..1510.16 rows=29816 width=36)"

Look at that cost: 9033491.00!! This took about 100 seconds to run on my laptop. It’s because the query is effectively doing a sequential scan for every record in the first table. Ugh. Then I swapped it to NOT EXISTS:

SELECT count(*) as total FROM tblLookupAuthenticateOld WHERE NOT EXISTS (select * FROM tblLookupAuthenticate WHERE uidMember = tblLookupAuthenticateOld.uidMember)

The EXPLAIN looks like:

"Aggregate (cost=247542.82..247542.83 rows=1 width=0)" " -> Seq Scan on tbllookupauthenticateold (cost=0.00..247505.54 rows=14910 width=0)" " Filter: (NOT (subplan))" " SubPlan" " -> Index Scan using tbllookupauthenticate_pkey1 on tbllookupauthenticate (cost=0.00..8.28 rows=1 width=298)" " Index Cond: (uidmember = $0)"

This query ran in about 500ms. That’s a “decent” improvement of 200:1. I’m planning to run a search in Eclipse tonight for “NOT IN” to see if there are any places where NOT IN once made sense but not [NOT] EXISTS is more appropriate as the subquery returns more and more rows.

Orange is my favorite color

SQL [NOT] EXISTS Goes Zoom

PostgreSQL - Web/Internet | Trackback | del.icio.us | digg

1 Comment

Andrew said:

Recent

Archives

Topics

Subscribe to Posts

Subscribe via Email