From: | Greg Sabino Mullane <htamfids(at)gmail(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, alvherre(at)alvh(dot)no-ip(dot)org, jian he <jian(dot)universality(at)gmail(dot)com> |
Subject: | Re: Statistics Import and Export |
Date: | 2025-03-01 18:52:07 |
Message-ID: | CAKAnmmJyXoVFttrdCz=R-oZnbcJifzSidP9Z2wHO=pU4-AEb0g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> Can you expand on some of those cases?
Certainly. I think one of the problems is that because this patch is
solving a pg_upgrade issue, the focus is on the "dump and restore"
scenarios. But pg_dump is used for much more than that, especially "dump
and examine".
Although pg_dump is meant to be a canonical, logical representation of your
schema and data, the stats add a non-determinant element to that.
Statistical sampling is random, so pg_dump output changes with each run.
(yes, COPY can also change, but much less so, as I argue later).
One use case is a program that is simply using pg_dump to verify that
nothing has modified your table data (I'll use a single table for these
examples, but obviously this applies to a whole database as well). So let's
say we create a table and populate it at time X, then check back at a later
time to verify things are still exactly as we left them.
dropdb gregtest
createdb gregtest
pgbench gregtest -i 2> /dev/null
pg_dump gregtest -t pgbench_accounts > a1
sleep 10
pg_dump gregtest -t pgbench_accounts > a2
diff a1 a2 | cut -c1-50
100078c100078
< 'histogram_bounds', '{2,964,1921,2917,3892,4935
---
> 'histogram_bounds', '{7,989,1990,2969,3973,4977
While COPY is not going to promise a particular output order, the order
should not change except for manual things: insert, update, delete,
truncate, vacuum full, cluster (off the top of my head). What should not
change the output is a background process gathering some metadata. Or
someone running a database-wide ANALYZE.
Another use case is someone rolling out their schema to a QA box. All the
table definitions and data are checked into a git repository, with a
checksum. They want to roll it out, and then verify that everything is
exactly as they expect it to be. Or the program is part of a test suite
that does a sanity check that the database is in an exact known state
before starting.
(Our system catalogs are very difficult when reverse engineering objects.
Thus, many programs rely on pg_dump to do the heavy lifting for them.
Parsing the text file generated by pg_dump is much easier than trying to
manipulate the system catalogs.)
So let's say the process is to create a new database, load things into it,
and then checksum the result. We can simulate that with pg_bench:
dropdb qa1; dropdb qa2
createdb qa1; createdb qa2
pgbench qa1 -i 2>/dev/null
pgbench qa2 -i 2>/dev/null
pg_dump qa1 > dump1; pg_dump qa2 > dump2
$ md5sum dump1
39a2da5e51e8541e9a2c025c918bf463 dump1
This md5sum does not match our repo! It doesn't even match the other one:
$ md5sum dump2
4a977657dfdf910cb66c875d29cfebf2 dump2
It's the stats, or course, which has added a dose of randomness that was
not there before, and makes our checksums useless:
$ diff dump1 dump2 | cut -c1-50
100172c100172
< 'histogram_bounds', '{1,979,1974,2952,3973,4900
---
> 'histogram_bounds', '{8,1017,2054,3034,4045,513
With --no-statistics, the diff shows no difference, and the md5sum is
always the same.
Just to be clear, I love this patch, and I love the fact that one of our
major upgrade warts is finally getting fixed. I've tried fixing it myself a
few times over the last decade or so, but lacked the skills to do so. :) So
I am thrilled to have this finally done. I just don't think it should be
enabled by default for everything using pg_dump. For the record, I would
not strongly object to having stats on by default for binary dumps,
although I would prefer them off.
So why not just expect people to modify their programs to use
--no-statistics for cases like this? That's certainly an option, but it's
going to break a lot of existing things, and create branching code:
old code:
pg_dump mydb -f pg.dump
new code:
if pg_dump.version >= 18
pg_dump --no-statistics mydb -f pg.dump
else
pg_dump mydb -f pg.dump
Also, anything trained to parse pg_dump output will have to learn about the
new SELECT pg_restore_ calls with their multi-line formats (not 100% sure
we don't have that anywhere, as things like "SELECT setval" and "SELECT
set_config" are single line, but there may be existing things)
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-03-01 18:56:51 | Re: Statistics Import and Export |
Previous Message | Alexander Lakhin | 2025-03-01 18:20:00 | Re: Statistics Import and Export |