From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Nikhil Kumar Veldanda <veldanda(dot)nikhilkumar17(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: ZStandard (with dictionaries) compression support for TOAST compression |
Date: | 2025-04-21 07:02:37 |
Message-ID: | aAXtjbBHdOjForut@paquier.xyz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Apr 18, 2025 at 12:22:18PM -0400, Robert Haas wrote:
> I think we could add plain-old zstd compression without really
> tackling this issue, but if we are going to add dictionaries then I
> think we might need to revisit the idea of preventing things from
> leaking out of tables. What I can't quite remember at the moment is
> how much of the problem was that it was going to be slow to force the
> recompression, and how much of it was that we weren't sure we could
> even find all the places in the code that might need such handling.
FWIW, this point resonates here. There is one thing that we have to
do anyway: we just have one bit left in the varlena headers as lz4 is
using the one before last. So we have to make it extensible, even if
it means that any compression method other than LZ4 and pglz would
consume one more byte in its header by default. And I think that this
has to happen at some point if we want flexibility in this area.
+ struct
+ {
+ uint32 va_header;
+ uint32 va_tcinfo;
+ uint32 va_cmp_alg;
+ uint32 va_cmp_dictid;
+ char va_data[FLEXIBLE_ARRAY_MEMBER];
+ } va_compressed_ext;
Speaking of which, I am confused by this abstraction choice in
varatt.h in the first patch. Are we sure that we are always going to
have a dictionary attached to a compressed data set or even a
va_cmp_alg? It seems to me that this could lead to a waste of data in
some cases because these fields may not be required depending on the
compression method used, as some fields may not care about these
details. This kind of data should be made optional, on a per-field
basis.
One thing that I've been wondering is how it would be possible to make
the area around varattrib_4b more readable while dealing with more
extensibility. It would be a good occasion to improve that, even if
I'm hand-waving here currently and that the majority of this code is
old enough to vote, with few modifications across the years.
The second thing that I'd love to see on top of the addition of the
extensibility is adding plain compression support for zstd, with
nothing fancy, just the compression and decompression bits. I've done
quite a few benchmarks with the two, and results kind of point in the
direction that zstd is more efficient than lz4 overall. Don't take me
wrong: lz4 can be better in some workloads as it can consume less CPU
than zstd while compressing less. However, a comparison of ratios
like (compression rate / cpu used) has always led me to see zstd as
superior in a large number of cases. lz4 is still very good if you
are CPU-bound and don't care about the extra space required. Both are
three classes better than pglz.
Once we have these three points incrementally built-in together (the
last bit extensibility, the potential varatt.h refactoring and the
zstd support), there may be a point in having support for more
advanced options with the compression methods in the shape of dicts or
more requirements linked to other compression methods, but I think the
topic is complex enough that we should make sure that these basics are
implemented in a way sane enough so as we'd be able to extend them
with all the use cases in mind.
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-04-21 07:29:56 | Re: doc patch: clarify the naming rule for injection_points |
Previous Message | Michael Paquier | 2025-04-21 06:22:28 | Re: Add Pipelining support in psql |