From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | Jacob Champion <jchampion(at)timescale(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> |
Subject: | Re: zstd compression for pg_dump |
Date: | 2023-03-04 16:57:48 |
Message-ID: | 20230304165747.GH12850@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Mar 03, 2023 at 01:38:05PM -0800, Jacob Champion wrote:
> > > With this particular dataset, I don't see much improvement with
> > > zstd:long.
> >
> > Yeah. I this could be because either 1) you already got very good
> > comprssion without looking at more data; and/or 2) the neighboring data
> > is already very similar, maybe equally or more similar, than the further
> > data, from which there's nothing to gain.
>
> What kinds of improvements do you see with your setup? I'm wondering
> when we would suggest that people use it.
On customer data, I see small improvements - below 10%.
But on my first two tries, I made synthetic data sets where it's a lot:
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long |wc -c
286107
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long=0 |wc -c
1709695
That's just 6 identical tables like:
pryzbyj=# CREATE TABLE t1 AS SELECT generate_series(1,999999);
In this case, "custom" format doesn't see that benefit, because the
greatest similarity is across tables, which don't share compressor
state. But I think the note that I wrote in the docs about that should
be removed - custom format could see a big benefit, as long as the table
is big enough, and there's more similarity/repetition at longer
distances.
Here's one where custom format *does* benefit, due to long-distance
repetition within a single table. The data is contrived, but the schema
of ID => data is not. What's notable isn't how compressible the data
is, but how much *more* compressible it is with long-distance matching.
pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM generate_series(1,444)i,generate_series(1,99999)j GROUP BY 1;
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c
82023
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c
1048267
--
Justin
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2023-03-04 17:12:41 | Re: libpq-fe.h should compile *entirely* standalone |
Previous Message | Jeff Davis | 2023-03-04 16:35:11 | Re: Request for comment on setting binary format output per session |