From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Nikhil Kumar Veldanda <veldanda(dot)nikhilkumar17(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: ZStandard (with dictionaries) compression support for TOAST compression |
Date: | 2025-04-25 00:48:08 |
Message-ID: | aArbyEowXBiAhM0a@paquier.xyz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Apr 23, 2025 at 11:59:26AM -0400, Robert Haas wrote:
> That's nice to know, but I think the key question is not so much what
> the feature costs when it is used but what it costs when it isn't
> used. If we implement a system where we don't let
> dictionary-compressed zstd datums leak out of tables, that's bound to
> slow down a CTAS from a table where this feature is used, but that's
> kind of OK: the feature has pros and cons, and if you don't like those
> tradeoffs, you don't have to use it. However, it sounds like this
> could also slow down inserts and updates in some cases even for users
> who are not making use of the feature, and that's going to be a major
> problem unless it can be shown that there is no case where the impact
> is at all significant. Users hate paying for features that they aren't
> using.
The cost of digesting a dictionnary when decompressing sets of values
is also something I think we should worry about, FWIW (see [1]), as
the digesting cost is documented as costly, so I think that there is
also an argument in making the feature efficient if used. That would
hurt if a sequential scan needs to detoast multiple blobs with the
same dict. If we attach that on a per-value value, wouldn't it imply
that we need to digest the dictionnary every time a blob is
decompressed? This information could be cached, but it seems a bit
weird to me to invent a new level of relation caching for would could
be attached as a relation attribute option in the relcache. If a
dictionnary gets trained with a new sample of values, we could rely on
the invalidation to pass the new information.
Based on what I'm reading and I know very little about the topic so I
may be wrong, but does it even make sense to allow multiple
dictionnaries to be used in a single attribute? Of course that may
depend on the JSON blob patterns a single attribute is dealing with,
but I'm not sure that this is worth the extra complexity this creates.
> I wonder if there's a possible design where we only allow
> dictionary-compressed datums to exist as top-level attributes in
> designated tables to which those dictionaries are attached; and any
> time you try to bury that Datum inside a container object (row, range,
> array, whatever) detoasting is forced. If there's a clean and
> inexpensive way to implement that, then you could avoid having
> heap_toast_insert_or_update care about HeapTupleHasExternal(), which
> seems like it might be a key point.
Interesting, not sure.
FWIW, I'd still try to focus on making varatt more extensible with
plain zstd support first, because diving in all these details. We are
going to need it anyway.
[1]: https://facebook.github.io/zstd/zstd_manual.html#Chapter10
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Tender Wang | 2025-04-25 01:21:55 | Re: Typos in the comment for the estimate_multivariate_ndistinct() |
Previous Message | Masahiko Sawada | 2025-04-25 00:31:48 | Re: Fix slot synchronization with two_phase decoding enabled |