Quick Links

Re: Performance tips

From:	Andrew Perrin <andrew_perrin(at)unc(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Justin Clift <justin(at)postgresql(dot)org>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Performance tips
Date:	2002-01-10 19:41:44
Message-ID:	Pine.LNX.4.21L1.0201101439510.404-100000@hm269-26876.socsci.unc.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

There probably were 5.7m in the subquery result - the DISTINCT helped
that, but not enough.

I used your (Tom Lane's) rewritten query, though, which was EXPLAINed as
hugely less expensive:

auth=# explain select count(patternid)
auth-# from patterns,
auth-# (select distinct o_patternid from letters, pattern_occurrences
where
auth(# letters.letterid = pattern_occurrences.o_letterid
auth(# and letters.datecat in (1,2)) AS ss
auth-# where patternid = ss.o_patternid;
NOTICE: QUERY PLAN:

Aggregate (cost=4486533.96..4486533.96 rows=1 width=8)
-> Merge Join (cost=4322005.65..4485096.53 rows=574973 width=8)
-> Sort (cost=2362674.85..2362674.85 rows=12472297 width=4)
-> Seq Scan on patterns (cost=0.00..259225.97
rows=12472297 width=4)
-> Sort (cost=1962135.80..1962135.80 rows=574973 width=12)
-> Subquery Scan ss (cost=1882121.28..1896495.61
rows=574973 width=12)
-> Unique (cost=1882121.28..1896495.61 rows=574973
width=12)
-> Sort (cost=1882121.28..1882121.28
rows=5749731 width=12)
-> Hash Join (cost=1741.00..863548.43
rows=5749731 width=12)
-> Seq Scan on pattern_occurrences
(cost=0.00..250248.56 rows=15287556 width=8)
-> Hash (cost=1729.67..1729.67
rows=4530 width=4)
-> Seq Scan on letters
(cost=0.00..1729.67 rows=4530 width=4)

and it finished in about 12 minutes - far more manageable.

Thanks, everyone, for your help - it's much appreciated! I'll keep y'all
abreast of the project's development.

Andy Perrin

On Thu, 10 Jan 2002, Tom Lane wrote:

> Andrew Perrin <andrew_perrin(at)unc(dot)edu> writes:
> > auth=# EXPLAIN select count(patternid) from patterns where patternid in
> > (select
> > auth(# o_patternid from letters, pattern_occurrences where
> > letters.letterid =
> > auth(# pattern_occurrences.o_letterid and letters.datecat in (1,2));
> > NOTICE: QUERY PLAN:
>
> > Aggregate (cost=10770432787318.88..10770432787318.88 rows=1 width=4)
> > -> Seq Scan on patterns (cost=0.00..10770432756138.14 rows=12472297
> > width=4)
> > SubPlan
> > -> Materialize (cost=863548.43..863548.43 rows=5749731
> > width=12)
> > -> Hash Join (cost=1741.00..863548.43 rows=5749731
> > width=12)
> > -> Seq Scan on pattern_occurrences
> > (cost=0.00..250248.56 rows=15287556 width=8)
> > -> Hash (cost=1729.67..1729.67 rows=4530 width=4)
> > -> Seq Scan on letters (cost=0.00..1729.67
> > rows=4530 width=4)
>
>
> Well, it's materializing the subquery result, which is good, but are
> there really going to be 5.7M rows in the subquery result? If so,
> no wonder you're hurting: the IN is going to be scanning through that
> result for each row from the outer query, until it either finds a match
> or reaches the end. Can you reduce the size of the subquery result at
> all? (If the subquery as written produces a lot of duplicate
> o_patternids, then making it be a SELECT DISTINCT might help.)
>
> The long-term answer is probably that you need to convert the IN to some
> smarter form of join. One idea that comes to mind is
>
> select count(patternid)
> from patterns,
> (select distinct o_patternid from letters, pattern_occurrences where
> letters.letterid = pattern_occurrences.o_letterid
> and letters.datecat in (1,2)) AS ss
> where patternid = ss.o_patternid;
>
> Given the "select distinct" to ensure there are no duplicates in the
> subselect output, this should produce the same output as the original,
> I think, and it would give the planner a shot at using a merge or hash
> join to match up the pattern id values.
>
> Oh, BTW: you might also try kicking up sort_mem if you didn't already.
>
> regards, tom lane
>

----------------------------------------------------------------------
Andrew J Perrin - andrew_perrin(at)unc(dot)edu - http://www.unc.edu/~aperrin
Assistant Professor of Sociology, U of North Carolina, Chapel Hill
269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA

In response to

Re: Performance tips at 2002-01-10 18:44:25 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Jelan	2002-01-10 19:48:06	Re: URGENT: pg_dump & Postgres 7.2b4
Previous Message	Tom Lane	2002-01-10 19:01:07	Re: URGENT: pg_dump & Postgres 7.2b4