Re: On disable_cost

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, andrew(at)ankane(dot)org
Subject: Re: On disable_cost
Date: 2024-08-23 17:11:45
Message-ID: CA+TgmoaAWQJRfs7n_fC3C+RHKDi--oJ8BwoXjM1NZnnYvaMkCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 23, 2024 at 11:17 AM Jonathan S. Katz <jkatz(at)postgresql(dot)org> wrote:
> We hit an issue with pgvector[0] where a regular `SELECT count(*) FROM
> table`[1] is attempting to scan the index on the vector column when
> `enable_seqscan` is disabled. Credit to Andrew Kane (CC'd) for flagging it.
>
> I was able to trace this back to e2225346. Here is a reproducer:

If I change EXPLAIN ANALYZE in this test to just EXPLAIN, I get this:

Aggregate (cost=179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00..179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00
rows=1 width=8)
-> Index Only Scan using test_embedding_idx on test
(cost=179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00..179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00
rows=5 width=0)

It took me a moment to wrap my head around this: the cost estimate is
312 decimal digits long. Apparently hnswcostestimate() just returns
DBL_MAX when there are no scan keys because it really, really doesn't
want to do that. Before e2225346, that kept this plan from being
generated because it was (much) larger than disable_cost. But now it
doesn't, because 1 disabled node makes a path more expensive than any
possible non-disabled path. Since that was the whole point of the
patch, I don't feel too bad about it.

I find it a little weird that hnsw thinks itself able to return all
the tuples in an order the user chooses, but unable to return all of
the tuples in an arbitrary order. In core, we have precedent for index
types that can't return individual tuples at all (gin, brin) but not
one that is able to return tuples in concept but has a panic attack if
you don't know how you want them sorted. I don't quite see why you
couldn't just treat that case the same as ORDER BY
the_first_column_of_the_index, or any other arbitrary rule that you
want to make up. Sure, it might be more expensive than a sequential
scan, but the user said they didn't want a sequential scan. I'm not
quite sure why pgvector thinks it gets to decide that it knows better
than the user, or the rest of the optimizer. I don't even think I
really believe it would always be worse: I've seen cases where a table
was badly bloated and mostly empty but its indexes were not bloated,
and in that case an index scan can be a HUGE winner even though it
would normally be a lot worse than a sequential scan.

If you don't want to fix hnsw to work the way the core optimizer
thinks it should, or if there's some reason it can't be done,
alternatives might include (1) having the cost estimate function hack
the count of disabled nodes and (2) adding some kind of core support
for an index cost estimator refusing a path entirely. I haven't tested
(1) so I don't know for sure that there are no issues, but I think we
have to do all of our cost estimating before we can think about adding
the path so I feel like there's a decent chance it would do what you
want.

Also, while I did take the initiative to download pgvector and compile
it and hook up a debugger and figure out what was going on here, I'm
not really too sure that's my job. I do think I have a responsibility
to help maintainers of out-of-core extensions who have problems as a
result of my commits, but I also think it's fair to hope that those
maintainers will try to minimize the amount of time that I need to
spend trying to read code that I did not write and do not maintain.
Fortunately, this wasn't hard to figure out, but in a way that's kind
of the point. That DBL_MAX hack was put there by somebody who must've
understood that they were trying to use a very large cost to disable a
certain path shape completely, and it seems to me that if that person
had studied this case and the commit message for e2225346, they would
have likely understood what had happened pretty quickly. Do you think
that's an unfair feeling on my part?

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-08-23 17:26:26 Re: On disable_cost
Previous Message Alexander Korotkov 2024-08-23 16:38:36 Re: POC, WIP: OR-clause support for indexes