Re: POC, WIP: OR-clause support for indexes

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, jian he <jian(dot)universality(at)gmail(dot)com>, Nikolay Shaplov <dhyan(at)nataraj(dot)su>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, teodor(at)sigaev(dot)ru, Peter Eisentraut <peter(at)eisentraut(dot)org>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2024-10-03 20:15:30
Message-ID: CAH2-WznLiw20U_Tp8Otm=g7hGcfWnodM4EOGzV=F3W-EvfQDTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 1, 2024 at 6:25 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> I think this patchset got much better, and it could possible be
> committed after another round of cleanup and comment/docs improvement.
> It would be very kind if you share your view on the decisions made in
> this patchset.

I do think that this patch got a lot better, and simpler, but I'm a
little worried about it not covering cases that are only very slightly
different to the ones that you're targeting. It's easiest to see what
I mean using an example.

After the standard regression tests have run, the following tests can
be run from psql (this uses the recent v40 revision):

pg(at)regression:5432 =# create index on tenk1(four, ten); -- setup
CREATE INDEX

Very fast INT_MAX query, since we successful use the transformation
added by the patch:

pg(at)regression:5432 =# explain (analyze,buffers) select * from tenk1
where four = 1 or four = 2_147_483_647 order by four, ten limit 5;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.29..1.73 rows=5 width=244) (actual time=0.011..0.014
rows=5 loops=1) │
│ Buffers: shared hit=4

│ -> Index Scan using tenk1_four_ten_idx on tenk1
(cost=0.29..721.25 rows=2500 width=244) (actual time=0.011..0.012
rows=5 loops=1) │
│ Index Cond: (four = ANY ('{1,2147483647}'::integer[]))

│ Index Searches: 1

│ Buffers: shared hit=4

│ Planning Time: 0.067 ms

│ Execution Time: 0.022 ms

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Much slower query, which is not capable of applying the transformation
due only to
the fact that I've "inadvertently" mixed together multiple types (int4
and int8):

pg(at)regression:5432 =# explain (analyze,buffers) select * from tenk1
where four = 1 or four = 2_147_483_648 order by four, ten limit 5;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.29..2.08 rows=5 width=244) (actual time=0.586..0.588
rows=5 loops=1) │
│ Buffers: shared hit=1368

│ -> Index Scan using tenk1_four_ten_idx on tenk1
(cost=0.29..900.25 rows=2500 width=244) (actual time=0.586..0.587
rows=5 loops=1) │
│ Index Searches: 1

│ Filter: ((four = 1) OR (four = '2147483648'::bigint))

│ Rows Removed by Filter: 2500

│ Buffers: shared hit=1368

│ Planning Time: 0.050 ms

│ Execution Time: 0.595 ms

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Do you think this problem can be fixed easily? This behavior seems
surprising, and is best avoided. Performance cliffs that happen when
we tweak one detail of a query just seem worth avoiding on general
principle.

Now that you're explicitly creating RestrictInfos for a particular
index, I suppose that it might be easier to do this kind of thing --
you have more context. Perhaps the patch can be made to recognize
a mix of constants like this as all being associated with the same
B-Tree operator family (the opfamily that the input opclass belongs
to)? Perhaps the constants could all be normalized to the same type via
casts/coercions into the underlying B-Tree input opclass -- that
extra step should be correct ("64.1.2. Behavior of B-Tree Operator Classes"
describes certain existing guarantees that this step would need to rely
on).

Note that the patch already works in cross-type scenarios, with
cross-type operators. The issue I've highlighted is caused by the use
of a mixture of types among the constants themselves -- the patch
wants an array with elements that are all of the same type, which it
can't quite manage. And so I can come up with a cross-type variant
query that *can* still use a SAOP as expected with v40, despite
involving a cross-type = btree operator:

pg(at)regression:5432 [2181876]=# explain (analyze,buffers) select * from
tenk1 where four = 2_147_483_648 or four = 2_147_483_649 order by
four, ten limit 5;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.29..6.53 rows=1 width=244) (actual time=0.004..0.005
rows=0 loops=1) │
│ Buffers: shared hit=2

│ -> Index Scan using tenk1_four_ten_idx on tenk1 (cost=0.29..6.53
rows=1 width=244) (actual time=0.004..0.004 rows=0 loops=1) │
│ Index Cond: (four = ANY
('{2147483648,2147483649}'::bigint[]))

│ Index Searches: 1

│ Buffers: shared hit=2

│ Planning Time: 0.044 ms

│ Execution Time: 0.011 ms

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

The fact that this third and final example works as expected makes me
even more convinced that the second example should behave similarly.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2024-10-03 20:19:56 Re: Replace IN VALUES with ANY in WHERE clauses during optimization
Previous Message Alena Rybakina 2024-10-03 20:10:25 Re: Replace IN VALUES with ANY in WHERE clauses during optimization