Re: On disable_cost

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, andrew(at)ankane(dot)org
Subject: Re: On disable_cost
Date: 2024-08-23 15:16:57
Message-ID: e87864a0-5983-4890-8663-aeb2b2c43187@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/21/24 10:29 AM, Robert Haas wrote:

> I went ahead and committed these patches. I know there's some debate
> over whether we want to show the # of disabled nodes and if so whether
> it should be controlled by COSTS, and I suspect I haven't completely
> allayed David's concerns about the initial_cost_XXX functions although
> I think that I did the right thing. But, I don't have the impression
> that anyone is desperately opposed to the basic concept, so I think it
> makes sense to put these into the tree and see what happens. We have
> quite a bit of time left in this release cycle to uncover bugs, hear
> from users or other developers, etc. about what problems there may be
> with this. If we end up deciding to reverse course or need to fix a
> bunch of stuff, so be it, but let's see what the feedback is.

We hit an issue with pgvector[0] where a regular `SELECT count(*) FROM
table`[1] is attempting to scan the index on the vector column when
`enable_seqscan` is disabled. Credit to Andrew Kane (CC'd) for flagging it.

I was able to trace this back to e2225346. Here is a reproducer:

Setup
=====

CREATE EXTENSION vector;

CREATE OR REPLACE FUNCTION public.generate_random_normalized_vector(dim
integer)
RETURNS vector
LANGUAGE SQL
AS $$
SELECT public.l2_normalize(array_agg(random()::real)::vector)
FROM generate_series(1, $1);
$$;

CREATE TABLE test (id int, embedding vector(128));
INSERT INTO test
SELECT n, public.generate_random_normalized_vector(128)
FROM generate_series(1,5) n;

CREATE INDEX ON test USING hnsw (embedding vector_cosine_ops);

Test
====

SET enable_seqscan TO off;
EXPLAIN ANALYZE
SELECT count(*) FROM test;

Before e2225346:
----------------

Aggregate (cost=10000041965.00..10000041965.01 rows=1 width=8) (actual
time=189.864..189.864 rows
=1 loops=1)
-> Seq Scan on test (cost=10000000000.00..10000040715.00 rows=5
width=0) (actual time=0.01
8..168.294 rows=5 loops=1)
(4 rows)

With e2225346:
-------------
ERROR: cannot scan hnsw index without order

Some things to note with the ivfflat/hnsw index AMs[3] in pgvector are
that they're used for "ORDER BY" scans exclusively. They currently don't
support index only scans (noting as I tried reproducing the issue with
GIST and couldn't do so because of that), but we wouldn't want to do a
full table "count(*)" on a IVFFlat/HNSW index anyway as it'd be more
expensive than just a full table scan.

Thanks,

Jonathan

[0] https://github.com/pgvector/pgvector
[1] https://github.com/pgvector/pgvector/actions/runs/10519052945
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e2225346
[3] https://github.com/pgvector/pgvector/blob/master/src/hnsw.c#L192

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-08-23 15:59:17 Re: Eager aggregation, take 3
Previous Message Robert Haas 2024-08-23 14:49:17 Re: pgsql: Introduce hash_search_with_hash_value() function