Re: Why is FOR ORDER BY function getting called when the index is handling ordering?

From: Chris Cleveland <ccleveland(at)dieselpoint(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is FOR ORDER BY function getting called when the index is handling ordering?
Date: 2024-05-02 20:42:05
Message-ID: CABSN6VfLK5msEDSR8wPeMh_h3xNyr2Xs5NJK8+uOZp0bVxw7Hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > ... there's no reason the system needs to know the actual float value
> > returned by rank_match(), the ordering operator distance function. In any
> > case, that value can only be calculated based on information in the index
> > itself, and can't be calculated by rank_match().
>
> This seems to me to be a very poorly designed concept. An index
> ordering operator is an optimization that the planner may or may
> not choose to employ. If you've designed your code on the assumption
> that that's the only possible plan, it will break for any but the
> most trivial queries.
>

So the use-case here is an enterprise search engine built using an index
access method. A search engine ranks its result set based on many, many
factors. Among them: token-specific weights, token statistics calculated
across the entire index, field lens, average field lens calculated across
the index, various kinds of linguistic analysis (verbs? nouns?), additional
terms added to the query drawn from other parts of the index, fuzzy terms
based on ngrams from across the index, and a great many other magic tricks.
There are also index-specific parameters that are specified at index time,
both as parameters to the op classes attached to each column, and options
specified by CREATE INDEX ... WITH (...).

If the system must generate a score for ranking using a simple ORDER BY
column_op_constant, it can't, because all that information within the index
itself isn't available.

In any case, I'm uninterested in making the world safe for a
> design that's going to fail if the planner doesn't choose an
> indexscan on a specific index. That's too fragile.
>
>
But that's the reality of search engines. It's also the reason that the
built-in pg full-text search has real limitations.

This isn't just a search engine problem. *Any* index type that depends on
whole-table statistics, or index options, or knowledge about other items in
the index will not be able to calculate a proper score without access to
the index. This applies to certain types of vector search. It could also
apply to a recommendation engine.

In general, it hasn't been a problem for my project because I adjust costs
to force the use of the index. (Yes, I know that doing this is
controversial, but I have little choice, and it works.)

The only area where it has become a problem is in the creation of the junk
column.

I do understand the need for the index to report the value of the sort key
up the food chain, because yes, all kinds of arbitrary re-ordering could
occur. We already have a mechanism for that, though:
IndexScanDesc.xs_orderbyvals. If there were a way for the system to use
that instead of a call to the ordering function, we'd be all set.

It would also be nice if the orderbyval could be made available in the
projection. That way we could report the score() in the result set.

Matthias' response and links touch on some of these issues.

--
Chris Cleveland
312-339-2677 mobile

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-05-02 21:31:00 Re: BitmapHeapScan streaming read user and prelim refactoring
Previous Message Noah Misch 2024-05-02 19:35:55 Re: Weird test mixup