Re: Query planner riddle (array-related?)

From: Markus <m(at)tfiu(dot)de>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Query planner riddle (array-related?)
Date: 2018-05-07 09:12:23
Message-ID: 20180507091223.62252s6ggmzjqdli@victor
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Tom,

On Fri, May 04, 2018 at 09:32:08AM -0400, Tom Lane wrote:
> Markus <m(at)tfiu(dot)de> writes:
> > I'm running Postgresql 9.6.7 on Debian stretch -- and I'm trying to
> > understand a query plan, with any hint where to gain further insight
> > welcome.
>
> Well, you say
>
> > select count(*) from gaia.dr2light where parallax>50;
> > gives 5400 rows in no time.
>
> but the planner thinks there are 12991627 such rows:

Ah... yeah, the parallax distribution is fairly sharply peaked around
0, so >50 might be severely off.

So, I've run

alter table gaia.dr2light alter parallax set statistics 1000;
analyze gaia.dr2light;

and lo and behold, the both queries become a good deal faster (a
couple of seconds). That's essentially enough to make me happy --
thanks!

> Also, this sort of thing is usually much easier to diagnose from
> EXPLAIN ANALYZE output. All we can see from these queries is that
> the planner picked what it thought was the lowest-cost plan. Without
> actual rowcounts it's very hard to guess why the estimates were wrong.
> You happened to provide one actual-rowcount number that maybe was
> enough to diagnose the issue; but if the above doesn't do the trick,
> we're going to need to see EXPLAIN ANALYZE to guess what else is up.

With this, the query plans converge to trivial variations of

Hash Join (cost=253856.23..4775113.84 rows=422 width=1647) (actual time=1967.095..2733.109 rows=18 loops=1)
Hash Cond: (dr2light.source_id = dr2epochflux.source_id)
-> Bitmap Heap Scan on dr2light (cost=24286.88..4385601.28 rows=1297329 width=132) (actual time=3.113..19.346 rows=5400 loops=1)
Recheck Cond: (parallax > '50'::double precision)
Heap Blocks: exact=5184
-> Bitmap Index Scan on dr2light_parallax (cost=0.00..23962.54 rows=1297329 width=0) (actual time=1.721..1.721 rows=5400 loops=1)
Index Cond: (parallax > '50'::double precision)
-> Hash (cost=118285.38..118285.38 rows=551038 width=1523) (actual time=1885.177..1885.177 rows=550737 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 53292kB
-> Seq Scan on dr2epochflux (cost=0.00..118285.38 rows=551038 width=1523) (actual time=0.008..430.692 rows=550737 loops=1)
Planning time: 6.504 ms
Execution time: 2733.653 ms

(with 10% or so jitter in the actual times, obviously); this one is
for

SELECT *
FROM gaia.dr2epochflux
JOIN gaia.dr2light
USING (source_id)
WHERE parallax>50

While that's a reasonable plan and fast enough, I'd still like to
keep the box from seqscanning dr2epochflux with its large arrays and
use that table's index on source_id. If I read the query plan right,
part of the problem is that it still massively overestimates the
result of parallax>50 (1297329 rather than 5400). Is there anything
I can do to improve that estimate?

But even with that suboptimal estimate, postgres, under the
assumption of something not too far from a uniform distribution on
source_id, should end up estimating the cardinality of the end result
as something like

(selectivity on dr2light)*(cardinality of dr2epochflux),

and hence roughly (1297329/1.6e9*5e5)=405 rows to be drawn from
dr2epochflux. It would seem a lot smarter to just pull these few 1e2
rows using the source_id index on dr2epochflux than seqscanning that
table, no?

Btw., I've raised the statistics target on dr2light to 1000 for
source_id; so, postgres should know source_id isn't uniformly
distributed, but it should also see it's not *that* far away from it.

-- Markus

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message SRINIVASARAO OGURI 2018-05-07 10:31:07 Re: postgres on physical replica crashes
Previous Message Laurenz Albe 2018-05-07 08:59:20 Re: Is it possible to get username information while writingtrigger?