Re: Why is FOR ORDER BY function getting called when the index is handling ordering?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Chris Cleveland <ccleveland(at)dieselpoint(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is FOR ORDER BY function getting called when the index is handling ordering?
Date: 2024-05-02 17:21:41
Message-ID: 2246002.1714670501@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Chris Cleveland <ccleveland(at)dieselpoint(dot)com> writes:
> I'm building an index access method which supports an ordering operator:

> CREATE OPERATOR pg_catalog.<<=>> (
> FUNCTION = rdb.rank_match,
> LEFTARG = record,
> RIGHTARG = rdb.RankSpec
> );

Okay ...

> ... there's no reason the system needs to know the actual float value
> returned by rank_match(), the ordering operator distance function. In any
> case, that value can only be calculated based on information in the index
> itself, and can't be calculated by rank_match().

This seems to me to be a very poorly designed concept. An index
ordering operator is an optimization that the planner may or may
not choose to employ. If you've designed your code on the assumption
that that's the only possible plan, it will break for any but the
most trivial queries.

> Nevertheless, the system calls rank_match() after every call to
> amgettuple(), and I can't figure out why.

The ORDER BY value is included in the set of values that the plan
is expected to output. This is so because it's set up to still
work if the planner needs to use an explicit sort step. For
instance, using a trivial table with a couple of bigint columns:

regression=# explain verbose select * from int8_tbl order by q1/2;
QUERY PLAN
----------------------------------------------------------------------
Sort (cost=1.12..1.13 rows=5 width=24)
Output: q1, q2, ((q1 / 2))
Sort Key: ((int8_tbl.q1 / 2))
-> Seq Scan on public.int8_tbl (cost=0.00..1.06 rows=5 width=24)
Output: q1, q2, (q1 / 2)
(5 rows)

The q1/2 column is marked "resjunk", so it doesn't actually get
sent to the client, but it's computed so that the sort step can
use it. Even if I do this:

regression=# create index on int8_tbl ((q1/2));
CREATE INDEX
regression=# set enable_seqscan TO 0;
SET
regression=# explain verbose select * from int8_tbl order by q1/2;
QUERY PLAN
-------------------------------------------------------------------------------------------
Index Scan using int8_tbl_expr_idx on public.int8_tbl (cost=0.13..12.22 rows=5 width=24)
Output: q1, q2, (q1 / 2)
(2 rows)

... it's still there, because the set of columns that the scan node is
expected to emit doesn't change based on the plan type. We could make
it change perhaps, if we tried hard enough, but so far nobody has
wanted to invest work in that. Note that even if this indexscan
is chosen, that doesn't ensure that we won't need an explicit sort
later, since the query might need joins or aggregation on top of
the scan node. So it's far from trivial to decide that the scan
node doesn't need to emit the sort column.

In any case, I'm uninterested in making the world safe for a
design that's going to fail if the planner doesn't choose an
indexscan on a specific index. That's too fragile.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-05-02 17:50:33 Re: Idea Feedback: psql \h misses -> Offers Links?
Previous Message Matthias van de Meent 2024-05-02 17:20:30 Re: Why is FOR ORDER BY function getting called when the index is handling ordering?