Re: Use of additional index columns in rows filtering

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Maxim Ivanov <hi(at)yamlcoder(dot)me>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, markus(dot)winand(at)winand(dot)at
Subject: Re: Use of additional index columns in rows filtering
Date: 2023-08-03 11:20:55
Message-ID: 7e840778-13ff-e7a9-ba3d-8926f99c7caf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/3/23 07:26, Peter Geoghegan wrote:
> On Wed, Aug 2, 2023 at 6:32 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I don't dispute the fact that this can only happen when the planner
>> believes (with good reason) that the expected cost will be lower. But
>> I maintain that there is a novel risk to be concerned about, which is
>> meaningfully distinct from the general risk of regressions that comes
>> from making just about any change to the planner. The important
>> principle here is that we should have a strong bias in the direction
>> of making quals into true "Access Predicates" whenever practical.
>>

OK, preference for access predicates sounds like a reasonable principle.

>> Yeah, technically the patch doesn't directly disrupt how existing
>> index paths get generated. But it does have the potential to disrupt
>> it indirectly, by providing an alternative very-similar index path
>> that's likely to outcompete the existing one in these cases. I think
>> that we should have just one index path that does everything well
>> instead.
>
> You can see this for yourself, quite easily. Start by running the
> relevant query from the regression tests, which is:
>
> SELECT * FROM tenk1 WHERE thousand = 42 AND (tenthous = 1 OR tenthous
> = 3 OR tenthous = 42);
>
> EXPLAIN (ANALYZE, BUFFERS) confirms that the patch makes the query
> slightly faster, as expected. I see 7 buffer hits for the bitmap index
> scan plan on master, versus only 4 buffer hits for the patch's index
> scan. Obviously, this is because we go from multiple index scans
> (multiple bitmap index scans) to only one.
>
> But, if I run this insert statement and try the same thing again,
> things look very different:
>
> insert into tenk1 (thousand, tenthous) select 42, i from
> generate_series(43, 1000) i;
>
> (Bear in mind that we've inserted rows that don't actually need to be
> selected by the query in question.)
>
> Now the master branch's plan works in just the same way as before --
> it has exactly the same overhead (7 buffer hits). Whereas the patch
> still gets the same risky plan -- which now blows up. The plan now
> loses by far more than it could ever really hope to win by: 336 buffer
> hits. (It could be a lot higher than this, even, but you get the
> picture.)

Are you sure? Because if I try this on master (62e9af4c without any
patches), I get this:

regression=# explain (verbose, analyze, buffers) SELECT * FROM tenk1
WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42);

QUERY PLAN
------------------------------------------------------------------------
Index Scan using tenk1_thous_tenthous on public.tenk1
(cost=0.29..1416.32 rows=1 width=244) (actual time=0.078..1.361 rows=1
loops=1)
Output: unique1, unique2, two, four, ten, twenty, hundred, thousand,
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
Index Cond: (tenk1.thousand = 42)
Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3) OR
(tenk1.tenthous = 42))
Rows Removed by Filter: 967
Buffers: shared hit=335
Planning Time: 0.225 ms
Execution Time: 1.430 ms
(8 rows)

So not sure about the claim that master works fine as before. OTOH, the
patched branch (with "my" patch 2023/07/16, just to be clear) does this:

QUERY PLAN
------------------------------------------------------------------------
Index Scan using tenk1_thous_tenthous on public.tenk1
(cost=0.29..23.57 rows=1 width=244) (actual time=0.077..0.669 rows=1
loops=1)
Output: unique1, unique2, two, four, ten, twenty, hundred, thousand,
twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
Index Cond: (tenk1.thousand = 42)
Index Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3) OR
(tenk1.tenthous = 42))
Rows Removed by Index Recheck: 967
Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3) OR
(tenk1.tenthous = 42))
Buffers: shared hit=7
Planning Time: 0.211 ms
Execution Time: 0.722 ms
(9 rows)

Which is just the 7 buffers ...

Did I do something wrong?

>
> Sure, it's difficult to imagine a very general model that captures
> this sort of risk, in the general case. But you don't need a degree in
> actuarial science to understand that it's inherently a bad idea to
> juggle chainsaws -- no matter what your general risk tolerance happens
> to be. Some things you just don't do.
>

Oh come on. I've been juggling chainsaws (lit on fire!) since I was a
little boy! There's nothing risky about it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-08-03 11:29:05 Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)
Previous Message Ashutosh Bapat 2023-08-03 11:20:05 Re: Oversight in reparameterize_path_by_child leading to executor crash