Inefficient nbtree behavior with row-comparison quals

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Inefficient nbtree behavior with row-comparison quals
Date: 2024-05-11 19:19:18
Message-ID: 66001.1715455158@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I spent some time looking into the performance complaint at [1],
which for the sake of self-containedness is

CREATE TABLE t(a int, b int);

INSERT INTO t(a, b)
SELECT
(random() * 123456)::int AS a,
(random() * 123456)::int AS b
FROM
generate_series(1, 12345678);

CREATE INDEX my_idx ON t USING BTREE (a, b);

VACUUM ANALYZE t;

explain (analyze, buffers) select * from t
where row(a, b) > row(123450, 123444) and a = 0
order by a, b;

This produces something like

Index Only Scan using my_idx on t (cost=0.43..8.46 rows=1 width=8) (actual time=475.713..475.713 rows=0 loops=1)
Index Cond: ((ROW(a, b) > ROW(123450, 123444)) AND (a = 0))
Heap Fetches: 0
Buffers: shared hit=1 read=33731
Planning:
Buffers: shared read=4
Planning Time: 0.247 ms
Execution Time: 475.744 ms

showing that we are reading practically the whole index, which
is pretty sad considering the index conditions are visibly
mutually contradictory. What's going on? I find that:

1. _bt_preprocess_keys, which is responsible for detecting
mutually-contradictory index quals, fails to do so because it
really just punts on row-comparison quals: it shoves them
directly from input to output scankey array without any
comparisons to other keys. (In particular, that causes the
row-comparison qual to appear before the a = 0 one in the
output scankey array.)

2. The initial-positioning logic in _bt_first chooses "a = 0"
as determining where to start the scan, because it always
prefers equality over inequality keys. (This seems reasonable.)

3. We really should stop the scan once we're past the last a = 0
index entry, which'd at least limit the damage. However, at both
a = 0 and later entries, the row-comparison qual fails causing
_bt_check_compare to return immediately, without examining the
a = 0 key which is marked as SK_BT_REQFWD, and thus it does not
clear "continuescan". Only when we finally reach an index entry
for which the row-comparison qual succeeds do we notice that
a = 0 is failing so we could stop the scan.

So this seems pretty horrid. It would be nice if _bt_preprocess_keys
were smart enough to notice the contradictory nature of these quals,
but I grant that (a) that routine is dauntingly complex already and
(b) this doesn't seem like a common enough kind of query to be worth
moving heaven and earth to optimize.

However, I do think we should do something about the unstated
assumption that _bt_preprocess_keys can emit the quals (for a given
column) in any random order it feels like. This is evidently not so,
and it's probably capable of pessimizing other examples besides this
one. Unless we want to slow down _bt_check_compare by making it
continue to examine quals after the first failure, we need to insist
that required quals appear before non-required quals, thus ensuring
that a comparison failure will clear continuescan if possible.

Even that looks a little nontrivial, because it seems like nbtree
may be making some assumptions about the order in which array keys
appear. I see the bit about

* ... Some reordering of the keys
* within each attribute may be done as a byproduct of the processing here.
* That process must leave array scan keys (within an attribute) in the same
* order as corresponding entries from the scan's BTArrayKeyInfo array info.

which I could cope with, but then there's this down around line 2967:

* Note: We do things this way around so that our arrays are
* always in the same order as their corresponding scan keys,
* even with incomplete opfamilies. _bt_advance_array_keys
* depends on this.

However, despite the rather over-the-top verbosity of commenting in
_bt_advance_array_keys, it's far from clear why or how it depends on
that. So I feel a little stuck about what needs to be done here.

regards, tom lane

[1] https://www.postgresql.org/message-id/CAAdwFAxBjyrYUkH7u%2BEceTaztd1QxBtBY1Teux8K%3DvcGKe%3D%3D-A%40mail.gmail.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-05-11 19:32:55 Re: First draft of PG 17 release notes
Previous Message Tomas Vondra 2024-05-11 19:18:21 Re: BitmapHeapScan streaming read user and prelim refactoring