I use fairly vanilla SQL most of the time so I was shocked to read this post on the PostgreSQL list when I ran into some slowness with a NOT IN query. While my experience is with Postgres, apparently this works with most databases.
If you have a NOT IN with a subquery that returns lots of rows, an EXISTS or NOT EXISTS will usually be faster. And not just a little bit. Like by a factor of 200:1 in my NOT EXISTS case! I am doing batch modification to a new schema and used this query to get a count of the remaining records while I was running my conversion scripts:
SELECT count(*) as total
FROM tblLookupAuthenticateOld
WHERE uidMember NOT IN (SELECT uidMember FROM tblLookupAuthenticate)
The EXPLAIN looks like:
"Aggregate (cost=9033491.00..9033491.01 rows=1 width=0)"
" -> Seq Scan on tbllookupauthenticateold (cost=1772.98..9033453.73 rows=14910 width=0)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Materialize (cost=1772.98..2304.14 rows=29816 width=36)"
" -> Seq Scan on tbllookupauthenticate (cost=0.00..1510.16 rows=29816 width=36)"
Look at that cost: 9033491.00!! This took about 100 seconds to run on my laptop. It’s because the query is effectively doing a sequential scan for every record in the first table. Ugh. Then I swapped it to NOT EXISTS:
SELECT count(*) as total
FROM tblLookupAuthenticateOld
WHERE NOT EXISTS (select * FROM tblLookupAuthenticate WHERE uidMember = tblLookupAuthenticateOld.uidMember)
The EXPLAIN looks like:
"Aggregate (cost=247542.82..247542.83 rows=1 width=0)"
" -> Seq Scan on tbllookupauthenticateold (cost=0.00..247505.54 rows=14910 width=0)"
" Filter: (NOT (subplan))"
" SubPlan"
" -> Index Scan using tbllookupauthenticate_pkey1 on tbllookupauthenticate (cost=0.00..8.28 rows=1 width=298)"
" Index Cond: (uidmember = $0)"
This query ran in about 500ms. That’s a “decent” improvement of 200:1. I’m planning to run a search in Eclipse tonight for “NOT IN” to see if there are any places where NOT IN once made sense but not [NOT] EXISTS is more appropriate as the subquery returns more and more rows.
Andrew said:
on May 16, 2008 at 7:04 am
The key here is “lots of rows”.
I’ve sometimes found that using a NOT IN is faster when I thought it would have been slower.