From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | gkokolatos(at)pm(dot)me |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, shiy(dot)fnst(at)fujitsu(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Rachel Heaton <rachelmheaton(at)gmail(dot)com> |
Subject: | Re: Add LZ4 compression in pg_dump |
Date: | 2023-02-27 04:49:10 |
Message-ID: | 20230227044910.GO1653@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> > I have some fixes (attached) and questions while polishing the patch for
> > zstd compression. The fixes are small and could be integrated with the
> > patch for zstd, but could be applied independently.
>
> One more - WriteDataToArchiveGzip() says:
One more again.
The LZ4 path is using non-streaming mode, which compresses each block
without persistent state, giving poor compression for -Fc compared with
-Fp. If the data is highly compressible, the difference can be orders
of magnitude.
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
12351763
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
21890708
That's not true for gzip:
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
2118869
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
2115832
The function ought to at least use streaming mode, so each block/row
isn't compressioned in isolation. 003 is a simple patch to use
streaming mode, which improves the -Fc case:
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
15178283
However, that still flushes the compression buffer, writing a block
header, for every row. With a single-column table, pg_dump -Fc -Z lz4
still outputs ~10% *more* data than with no compression at all. And
that's for compressible data.
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
12890296
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
11890296
I think this should use the LZ4F API with frames, which are buffered to
avoid outputting a header for every single row. The LZ4F format isn't
compatible with the LZ4 format, so (unlike changing to the streaming
API) that's not something we can change in a bugfix release. I consider
this an Opened Item.
With the LZ4F API in 004, -Fp and -Fc are essentially the same size
(like gzip). (Oh, and the output is three times smaller, too.)
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
4155448
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
4156548
--
Justin
Attachment | Content-Type | Size |
---|---|---|
0001-f-fixes-for-LZ4.patch | text/x-diff | 3.2 KB |
0002-f-fixes-for-LZ4-which-also-conflict-with-the-ZSTD-pa.patch | text/x-diff | 2.5 KB |
0003-pg_dump-lz4-use-lz4-streaming-compression.patch | text/x-diff | 1.7 KB |
0004-WIP-change-to-use-LZ4-frame-API.patch | text/x-diff | 5.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andrey Borodin | 2023-02-27 04:55:45 | Re: psql \watch 2nd argument: iteration count |
Previous Message | Pavel Stehule | 2023-02-27 04:45:04 | Re: Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql) |