Re: Add LZ4 compression in pg_dump

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: gkokolatos(at)pm(dot)me
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Rachel Heaton <rachelmheaton(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Add LZ4 compression in pg_dump
Date: 2022-07-05 15:13:28
Message-ID: 20220705151328.GQ13040@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is a review of 0001.

On Tue, Jul 05, 2022 at 01:22:47PM +0000, gkokolatos(at)pm(dot)me wrote:
> Simply this patchset had started to divert
> heavily already based on comments from Mr. Paquier who had already requested for
> the APIs to be refactored to use function pointers. This is happening in 0002 of
> the patchset.

I said something about reducing ifdefs, but I'm having trouble finding what
Michael said about this ?

> > On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:
> >
> > > LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.
> > >
> > > I ran into that on an ubuntu LTS, so I don't think it's so old that it
> > > shouldn't be handled more gracefully. LZ4 should either have an explicit
> > > version check, or else shouldn't depend on that feature (or should define a
> > > safe fallback version if the library header doesn't define it).
> > > https://packages.ubuntu.com/liblz4-1

The constant still seems to be used without defining a fallback or a minimum version.

> > > 0003: typo: of legacy => or legacy

This is still there

> > > You renamed this:
> > >
> > > |- COMPR_ALG_LIBZ
> > > |-} CompressionAlgorithm;
> > > |+ COMPRESSION_GZIP,
> > > |+} CompressionMethod;
> > >
> > > ..But I don't think that's an improvement. If you were to change it, it should
> > > say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
> > > structs and typedefs. zlib is not idential to gzip, which uses a different
> > > header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.

This comment still applies - zlib's gz* functions are "gzip" but the others are
"zlib". https://zlib.net/manual.html

That affects both the 0001 and 0002 patches.

Actually, I think that "gzip" should not be the name of the user-facing option,
since (except for "plain" format) it isn't using gzip.

+Robert, since this suggests amending parse_compress_algorithm(). Maybe "zlib"
should be parsed the same way as "gzip" - I don't think we ever expose both to
a user, but in some cases (basebackup and pg_dump -Fp -Z1) the output is "gzip"
and in some cases NO it's zlib (pg_dump -Fc -Z1).

> > > The cf* changes in pg_backup_archiver could be split out into a separate
> > > commit. It's strictly a code simplification - not just preparation for more
> > > compression algorithms. The commit message should "See also:
> > > bf9aa490db24b2334b3595ee33653bf2fe39208c".

I still think this could be an early, 0000 patch.

> > > freebsd/cfbot is failing.

This is still failing for bsd, windows and compiler warnings.
Windows also has compiler warnings.
http://cfbot.cputube.org/georgios-kokolatos.html

Please see: src/tools/ci/README, which you can use to run check-world on 4 OS
by pushing a branch to github.

> > > I suggested off-list to add an 0099 patch to change LZ4 to the default, to
> > > exercise it more on CI.

What about this ? I think the patch needs to pass CI on all 4 OS with
default=zlib and default=lz4.

> > On Sat, Mar 26, 2022 at 01:33:36PM -0500, Justin Pryzby wrote:

> @@ -254,7 +251,12 @@ CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
> Archive *
> OpenArchive(const char *FileSpec, const ArchiveFormat fmt)
> {
> - ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker);
> + ArchiveHandle *AH;
> + pg_compress_specification compress_spec;

Should this be initialized to {0} ?

> @@ -969,6 +969,8 @@ NewRestoreOptions(void)
> opts->format = archUnknown;
> opts->cparams.promptPassword = TRI_DEFAULT;
> opts->dumpSections = DUMP_UNSECTIONED;
> + opts->compress_spec.algorithm = PG_COMPRESSION_NONE;
> + opts->compress_spec.level = INT_MIN;

Why INT_MIN ?

> @@ -1115,23 +1117,28 @@ PrintTOCSummary(Archive *AHX)
> ArchiveHandle *AH = (ArchiveHandle *) AHX;
> RestoreOptions *ropt = AH->public.ropt;
> TocEntry *te;
> + pg_compress_specification out_compress_spec;

Should have {0} ?
I suggest to write it like my 2020 patch for this, which says:
no_compression = {0};

> /* Open stdout with no compression for AH output handle */
> - AH->gzOut = 0;
> - AH->OF = stdout;
> + out_compress_spec.algorithm = PG_COMPRESSION_NONE;
> + AH->OF = cfdopen(dup(fileno(stdout)), PG_BINARY_A, out_compress_spec);

Ideally this should check the success of dup().

> @@ -3776,21 +3746,25 @@ ReadHead(ArchiveHandle *AH)
> + if (AH->compress_spec.level != INT_MIN)

Why is it testing the level and not the algorithm ?

> --- a/src/bin/pg_dump/pg_backup_custom.c
> +++ b/src/bin/pg_dump/pg_backup_custom.c
> @@ -298,7 +298,7 @@ _StartData(ArchiveHandle *AH, TocEntry *te)
> _WriteByte(AH, BLK_DATA); /* Block type */
> WriteInt(AH, te->dumpId); /* For sanity check */
>
> - ctx->cs = AllocateCompressor(AH->compression, _CustomWriteFunc);
> + ctx->cs = AllocateCompressor(AH->compress_spec, _CustomWriteFunc);

Is it necessary to rename the data structure ?
If not, this file can remain unchanged.

> --- a/src/bin/pg_dump/pg_backup_directory.c
> +++ b/src/bin/pg_dump/pg_backup_directory.c
> @@ -573,6 +574,7 @@ _CloseArchive(ArchiveHandle *AH)
> if (AH->mode == archModeWrite)
> {
> cfp *tocFH;
> + pg_compress_specification compress_spec;

Should use {0} ?

> @@ -639,12 +642,14 @@ static void
> _StartBlobs(ArchiveHandle *AH, TocEntry *te)
> {
> lclContext *ctx = (lclContext *) AH->formatData;
> + pg_compress_specification compress_spec;

Same

> + /*
> + * Custom and directory formats are compressed by default (zlib), others
> + * not
> + */
> + if (user_compression_defined == false)

Should be: !user_compression_defined

Your 0001+0002 patches (without 0003) fail to compile:

pg_backup_directory.c: In function ‘_ReadByte’:
pg_backup_directory.c:519:12: error: ‘CompressFileHandle’ {aka ‘struct CompressFileHandle’} has no member named ‘_IO_getc’
519 | return CFH->getc(CFH);
| ^~
pg_backup_directory.c:520:1: warning: control reaches end of non-void function [-Wreturn-type]
520 | }

--
Justin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema 2022-07-05 15:23:04 Re: [EXTERNAL] Re: Support load balancing in libpq
Previous Message Tom Lane 2022-07-05 14:33:19 Re: Wrong provolatile value for to_timestamp (1 argument)