Quick Links

Re: zstd compression for pg_dump

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Jacob Champion <jchampion(at)timescale(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject:	Re: zstd compression for pg_dump
Date:	2023-03-04 16:57:48
Message-ID:	20230304165747.GH12850@telsasoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Mar 03, 2023 at 01:38:05PM -0800, Jacob Champion wrote:
> > > With this particular dataset, I don't see much improvement with
> > > zstd:long.
> >
> > Yeah. I this could be because either 1) you already got very good
> > comprssion without looking at more data; and/or 2) the neighboring data
> > is already very similar, maybe equally or more similar, than the further
> > data, from which there's nothing to gain.
>
> What kinds of improvements do you see with your setup? I'm wondering
> when we would suggest that people use it.

On customer data, I see small improvements - below 10%.

But on my first two tries, I made synthetic data sets where it's a lot:

$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long |wc -c
286107
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long=0 |wc -c
1709695

That's just 6 identical tables like:
pryzbyj=# CREATE TABLE t1 AS SELECT generate_series(1,999999);

In this case, "custom" format doesn't see that benefit, because the
greatest similarity is across tables, which don't share compressor
state. But I think the note that I wrote in the docs about that should
be removed - custom format could see a big benefit, as long as the table
is big enough, and there's more similarity/repetition at longer
distances.

Here's one where custom format *does* benefit, due to long-distance
repetition within a single table. The data is contrived, but the schema
of ID => data is not. What's notable isn't how compressible the data
is, but how much *more* compressible it is with long-distance matching.

pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM generate_series(1,444)i,generate_series(1,99999)j GROUP BY 1;
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c
82023
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c
1048267

--
Justin

In response to

Re: zstd compression for pg_dump at 2023-03-03 21:38:05 from Jacob Champion

Responses

Re: zstd compression for pg_dump at 2023-03-08 18:59:23 from Jacob Champion

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2023-03-04 17:12:41	Re: libpq-fe.h should compile entirely standalone
Previous Message	Jeff Davis	2023-03-04 16:35:11	Re: Request for comment on setting binary format output per session