From: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Subject: | Large expressions in indexes can't be stored (non-TOASTable) |
Date: | 2024-09-03 16:35:42 |
Message-ID: | b611015f-b423-458c-aa2d-be0e655cc1b4@postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I ran into an issue (previously discussed[1]; quoting Andres out of
context that not addressing it then would "[a]ll but guarantee that
we'll have this discussion again"[2]) when trying to build a very large
expression index that did not fit within the page boundary. The
real-world use case was related to a vector search technique where I
wanted to use binary quantization based on the relationship between a
constant vector (the average at a point-in-time across the entire data
set) and the target vector[3][4]. An example:
CREATE INDEX ON embeddings
USING hnsw((quantization_func(embedding, $VECTOR)) bit_hamming_ops);
However, I ran into the issue in[1], where pg_index was identified as
catalog that is missing a toast table, even though `indexprs` is marked
for extended storage. Providing a very simple reproducer in psql below:
----
CREATE TABLE def (id int);
SELECT array_agg(n) b FROM generate_series(1,10_000) n \gset
CREATE OR REPLACE FUNCTION vec_quantizer (a int, b int[]) RETURNS bool
AS $$ SELECT true $$ LANGUAGE SQL IMMUTABLE;
CREATE INDEX ON def (vec_quantizer(id, :'b'));
ERROR: row is too big: size 29448, maximum size 8160
---
This can come up with vector searches as vectors can be quite large -
the case I was testing involved a 1536-dim floating point vector (~6KB),
and the node parse tree pushed past the page boundary by about 2KB.
One could argue that pgvector or an extension can build in capabilities
to handle quantization internally without requiring the user to provide
a source vector (pgvectorscale does this). However, this also limits
flexibility to users, as they may want to bring their own quantization
functions to vector searches, e.g., as different quantization techniques
emerge, or if a particular technique is more suitable for a person's
dataset.
Thanks,
Jonathan
[1]
https://www.postgresql.org/message-id/flat/84ddff04-f122-784b-b6c5-3536804495f8%40joeconway.com
[2]
https://www.postgresql.org/message-id/20180720000356.5zkhvfpsqswngyob%40alap3.anarazel.de
[3] https://github.com/pgvector/pgvector
[4] https://jkatz05.com/post/postgres/pgvector-scalar-binary-quantization/
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2024-09-03 16:42:01 | Re: Inline non-SQL SRFs using SupportRequestSimplify |
Previous Message | Tomas Vondra | 2024-09-03 16:19:45 | Re: scalability bottlenecks with (many) partitions (and more) |