Re: ZStandard (with dictionaries) compression support for TOAST compression

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nikhil Kumar Veldanda <veldanda(dot)nikhilkumar17(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ZStandard (with dictionaries) compression support for TOAST compression
Date: 2025-04-18 16:22:18
Message-ID: CA+Tgmob1Ux7R2_0mbZmTbsVmUda04X1AdHGjpu6FkZfdbztugQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 15, 2025 at 2:13 PM Nikhil Kumar Veldanda
<veldanda(dot)nikhilkumar17(at)gmail(dot)com> wrote:
> Addressing Compressed Datum Leaks problem (via CTAS, INSERT INTO ... SELECT ...)
>
> As compressed datums can be copied to other unrelated tables via CTAS,
> INSERT INTO ... SELECT, or CREATE TABLE ... EXECUTE, I’ve introduced a
> method inheritZstdDictionaryDependencies. This method is invoked at
> the end of such statements and ensures that any dictionary
> dependencies from source tables are copied to the destination table.
> We determine the set of source tables using the relationOids field in
> PlannedStmt.

With the disclaimer that I haven't opened the patch or thought
terribly deeply about this issue, at least not yet, my fairly strong
suspicion is that this design is not going to work out, for multiple
reasons. In no particular order:

1. I don't think users will like it if dependencies on a zstd
dictionary spread like kudzu across all of their tables. I don't think
they'd like it even if it were 100% accurate, but presumably this is
going to add dependencies any time there MIGHT be a real dependency
rather than only when there actually is one.

2. Inserting into a table or updating it only takes RowExclusiveLock,
which is not even self-exclusive. I doubt that it's possible to change
system catalogs in a concurrency-safe way with such a weak lock. For
instance, if two sessions tried to do the same thing in concurrent
transactions, they could both try to add the same dependency at the
same time.

3. I'm not sure that CTAS, INSERT INTO...SELECT, and CREATE
TABLE...EXECUTE are the only ways that datums can creep from one table
into another. For example, what if I create a plpgsql function that
gets a value from one table and stores it in a variable, and then use
that variable to drive an INSERT into another table? I seem to recall
there are complex cases involving records and range types and arrays,
too, where the compressed object gets wrapped inside of another
object; though maybe that wouldn't matter to your implementation if
INSERT INTO ... SELECT uses a sufficiently aggressive strategy for
adding dependencies.

When Dilip and I were working on lz4 TOAST compression, my first
instinct was to not let LZ4-compressed datums leak out of a table by
forcing them to be decompressed (and then possibly recompressed). We
spent a long time trying to make that work before giving up. I think
this is approximately where things started to unravel, and I'd suggest
you read both this message and some of the discussion before and
after:

https://www.postgresql.org/message-id/20210316185455.5gp3c5zvvvq66iyj@alap3.anarazel.de

I think we could add plain-old zstd compression without really
tackling this issue, but if we are going to add dictionaries then I
think we might need to revisit the idea of preventing things from
leaking out of tables. What I can't quite remember at the moment is
how much of the problem was that it was going to be slow to force the
recompression, and how much of it was that we weren't sure we could
even find all the places in the code that might need such handling.

I'm now also curious to know whether Andres would agree that it's bad
if zstd dictionaries are un-droppable. After all, I thought it would
be bad if there was no way to eliminate a dependency on a compression
method, and he disagreed. So maybe he would also think undroppable
dictionaries are fine. But maybe not. It seems even worse to me than
undroppable compression methods, because you'll probably not have that
many compression methods ever, but you could have a large number of
dictionaries eventually.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2025-04-18 17:01:17 Re: [PoC] Federated Authn/z with OAUTHBEARER
Previous Message Robert Haas 2025-04-18 15:31:19 Re: magical eref alias names