From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, andrew(at)ankane(dot)org |
Subject: | Re: On disable_cost |
Date: | 2024-08-23 18:18:32 |
Message-ID: | c6e9f89f-efe9-4588-a8e3-3c41c89717ca@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 23/08/2024 20:11, Robert Haas wrote:
> I find it a little weird that hnsw thinks itself able to return all
> the tuples in an order the user chooses, but unable to return all of
> the tuples in an arbitrary order.
HNSW is weird in many ways:
- There is no inherent sort order. It cannot do "ORDER BY column", only
kNN-sort like "ORDER BY column <-> value".
- It's approximate. It's not guaranteed to return the same set of rows
as a sequential scan + sort.
- The number of results it returns is limited by the hnsw.ef_search GUC,
default 100.
- It collects all the results (up to hnsw.ef_search) in memory, and only
then returns them. So if you tried to use it with a large number of
results, it can simply run out of memory.
Arguably all of those are bugs in HNSW, but it is what it is. The
algorithm is inherently approximate. Despite that, it's useful in practice.
> In core, we have precedent for index
> types that can't return individual tuples at all (gin, brin) but not
> one that is able to return tuples in concept but has a panic attack if
> you don't know how you want them sorted.
Well, we do also have gin_fuzzy_search_limit. Two wrongs doesn't make it
right, though; I'd love to get rid of that hack too somehow.
> I don't quite see why you
> couldn't just treat that case the same as ORDER BY
> the_first_column_of_the_index, or any other arbitrary rule that you
> want to make up. Sure, it might be more expensive than a sequential
> scan, but the user said they didn't want a sequential scan. I'm not
> quite sure why pgvector thinks it gets to decide that it knows better
> than the user, or the rest of the optimizer. I don't even think I
> really believe it would always be worse: I've seen cases where a table
> was badly bloated and mostly empty but its indexes were not bloated,
> and in that case an index scan can be a HUGE winner even though it
> would normally be a lot worse than a sequential scan.
Sure, you could make it work. It could construct a vector out of thin
air to compare with, when there's no scan key, or implement a completely
different codepath that traverses the full graph in no particular order.
> If you don't want to fix hnsw to work the way the core optimizer
> thinks it should, or if there's some reason it can't be done,
> alternatives might include (1) having the cost estimate function hack
> the count of disabled nodes and (2) adding some kind of core support
> for an index cost estimator refusing a path entirely. I haven't tested
> (1) so I don't know for sure that there are no issues, but I think we
> have to do all of our cost estimating before we can think about adding
> the path so I feel like there's a decent chance it would do what you
> want.
It would seem useful for an index AM to be able to say "nope, I can't do
this". I don't remember how exactly this stuff works, but I'm surprised
it doesn't already exist.
--
Heikki Linnakangas
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | Jonathan S. Katz | 2024-08-23 18:19:56 | Re: On disable_cost |
Previous Message | Tom Lane | 2024-08-23 17:42:53 | Re: On disable_cost |