Quick Links

Re: sequence query performance issues

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Peter Koczan <pjkoczan(at)gmail(dot)com>
Cc:	pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: sequence query performance issues
Date:	2007-09-28 08:06:11
Message-ID:	46FCB5F3.4080600@archonet.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Peter Koczan wrote:
> Hello,
>
> I have a weird performance issue with a query I'm testing. Basically,
> I'm trying to port a function that generates user uids, and since
> postgres offers a sequence generator function, I figure I'd take
> advantage of that. Basically, I generate our uid range, filter out
> those which are in use, and randomly pick however many I need.
> However, when I run it it takes forever (>10 minutes and I get nothing
> so I cancelled the query) and cpu usage on the server is maxed out.

I'd suspect either an unconstrained join or looping through seq-scans.

> Here's my query (I'll post the explain output later so as not to
> obscure my question):
> => select a.uid from generate_series(1000, 32767) as a(uid) where
> a.uid not in (select uid from people) order by random() limit 1;

I let this run to it's conclusion and it's the materialize. If you see,
it's materializing the result-set once for every value it tests against
(loops=31768)

QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=62722.66..62722.67 rows=1 width=4) (actual
time=189963.485..189963.485 rows=0 loops=1)
-> Sort (cost=62722.66..62723.91 rows=500 width=4) (actual
time=189961.063..189961.063 rows=0 loops=1)
Sort Key: random()
-> Function Scan on generate_series a (cost=184.00..62700.25
rows=500 width=4) (actual time=189960.797..189960.797 rows=0 loops=1)
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=184.00..284.00 rows=10000
width=2) (actual time=0.000..2.406 rows=9372 loops=31768)
-> Seq Scan on people (cost=0.00..174.00
rows=10000 width=2) (actual time=0.055..7.181 rows=10000 loops=1)
Total runtime: 189967.150 ms

Hmm - why is it doing that? It's clearly confused about something.

I suspect the root of the problem is that it doesn't know what
generate_series() will return. To the planner it's just another
set-returning function.

This means it's getting (i) the # of rows wrong (rows=500) and also
doesn't know (ii) there will be no nulls or (iii) what the range of
values returned will be.

Easy enough to test:

CREATE TEMP TABLE all_uids (uid int2);
INSERT INTO all_uids SELECT generate_series(1000,32767);
ANALYSE all_uids;

EXPLAIN ANALYSE SELECT a.uid
FROM all_uids a
WHERE a.uid NOT IN (SELECT uid FROM people)
ORDER BY random() LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1884.14..1884.14 rows=1 width=2) (actual
time=39.019..39.019 rows=0 loops=1)
-> Sort (cost=1884.14..1923.85 rows=15884 width=2) (actual
time=39.014..39.014 rows=0 loops=1)
Sort Key: random()
-> Seq Scan on all_uids a (cost=199.00..775.81 rows=15884
width=2) (actual time=38.959..38.959 rows=0 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on people (cost=0.00..174.00 rows=10000
width=2) (actual time=0.046..7.282 rows=10000 loops=1)
Total runtime: 39.284 ms

That's more sensible.

I'd actually use a table to track unused_uids and have triggers that
kept everything in step. However, if you didn't want to do that, I'd try
a left-join.

EXPLAIN ANALYSE
SELECT a.uid
FROM generate_series(1000, 32767) as a(uid) LEFT JOIN people p ON
a.uid=p.uid
WHERE
p.uid IS NULL
ORDER BY random() LIMIT 1;

Not ideal, but like I say I'd use an unused_uids table. If nothing else,
I'd be wary about immediately re-using a uid - your db+application might
cope fine, but these values have a tendency to be referred to elsewhere.

HTH
--
Richard Huxton
Archonet Ltd

In response to

sequence query performance issues at 2007-09-27 22:04:35 from Peter Koczan

Responses

Re: sequence query performance issues at 2007-09-28 14:16:19 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Csaba Nagy	2007-09-28 08:12:10	Re: Searching for the cause of a bad plan
Previous Message	Peter Koczan	2007-09-27 22:04:35	sequence query performance issues