Large expressions in indexes can't be stored (non-TOASTable)

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Subject: Large expressions in indexes can't be stored (non-TOASTable)
Date: 2024-09-03 16:35:42
Message-ID: b611015f-b423-458c-aa2d-be0e655cc1b4@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I ran into an issue (previously discussed[1]; quoting Andres out of
context that not addressing it then would "[a]ll but guarantee that
we'll have this discussion again"[2]) when trying to build a very large
expression index that did not fit within the page boundary. The
real-world use case was related to a vector search technique where I
wanted to use binary quantization based on the relationship between a
constant vector (the average at a point-in-time across the entire data
set) and the target vector[3][4]. An example:

CREATE INDEX ON embeddings
USING hnsw((quantization_func(embedding, $VECTOR)) bit_hamming_ops);

However, I ran into the issue in[1], where pg_index was identified as
catalog that is missing a toast table, even though `indexprs` is marked
for extended storage. Providing a very simple reproducer in psql below:

----
CREATE TABLE def (id int);
SELECT array_agg(n) b FROM generate_series(1,10_000) n \gset
CREATE OR REPLACE FUNCTION vec_quantizer (a int, b int[]) RETURNS bool
AS $$ SELECT true $$ LANGUAGE SQL IMMUTABLE;
CREATE INDEX ON def (vec_quantizer(id, :'b'));

ERROR: row is too big: size 29448, maximum size 8160
---

This can come up with vector searches as vectors can be quite large -
the case I was testing involved a 1536-dim floating point vector (~6KB),
and the node parse tree pushed past the page boundary by about 2KB.

One could argue that pgvector or an extension can build in capabilities
to handle quantization internally without requiring the user to provide
a source vector (pgvectorscale does this). However, this also limits
flexibility to users, as they may want to bring their own quantization
functions to vector searches, e.g., as different quantization techniques
emerge, or if a particular technique is more suitable for a person's
dataset.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/flat/84ddff04-f122-784b-b6c5-3536804495f8%40joeconway.com
[2]
https://www.postgresql.org/message-id/20180720000356.5zkhvfpsqswngyob%40alap3.anarazel.de
[3] https://github.com/pgvector/pgvector
[4] https://jkatz05.com/post/postgres/pgvector-scalar-binary-quantization/

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-09-03 16:42:01 Re: Inline non-SQL SRFs using SupportRequestSimplify
Previous Message Tomas Vondra 2024-09-03 16:19:45 Re: scalability bottlenecks with (many) partitions (and more)