Quick Links

Re: Statistics Import and Export

From:	Greg Sabino Mullane <htamfids(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, alvherre(at)alvh(dot)no-ip(dot)org, jian he <jian(dot)universality(at)gmail(dot)com>
Subject:	Re: Statistics Import and Export
Date:	2025-03-01 18:52:07
Message-ID:	CAKAnmmJyXoVFttrdCz=R-oZnbcJifzSidP9Z2wHO=pU4-AEb0g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> Can you expand on some of those cases?

Certainly. I think one of the problems is that because this patch is
solving a pg_upgrade issue, the focus is on the "dump and restore"
scenarios. But pg_dump is used for much more than that, especially "dump
and examine".

Although pg_dump is meant to be a canonical, logical representation of your
schema and data, the stats add a non-determinant element to that.
Statistical sampling is random, so pg_dump output changes with each run.
(yes, COPY can also change, but much less so, as I argue later).

One use case is a program that is simply using pg_dump to verify that
nothing has modified your table data (I'll use a single table for these
examples, but obviously this applies to a whole database as well). So let's
say we create a table and populate it at time X, then check back at a later
time to verify things are still exactly as we left them.

dropdb gregtest
createdb gregtest
pgbench gregtest -i 2> /dev/null
pg_dump gregtest -t pgbench_accounts > a1
sleep 10
pg_dump gregtest -t pgbench_accounts > a2
diff a1 a2 | cut -c1-50

100078c100078
< 'histogram_bounds', '{2,964,1921,2917,3892,4935
---
> 'histogram_bounds', '{7,989,1990,2969,3973,4977

While COPY is not going to promise a particular output order, the order
should not change except for manual things: insert, update, delete,
truncate, vacuum full, cluster (off the top of my head). What should not
change the output is a background process gathering some metadata. Or
someone running a database-wide ANALYZE.

Another use case is someone rolling out their schema to a QA box. All the
table definitions and data are checked into a git repository, with a
checksum. They want to roll it out, and then verify that everything is
exactly as they expect it to be. Or the program is part of a test suite
that does a sanity check that the database is in an exact known state
before starting.

(Our system catalogs are very difficult when reverse engineering objects.
Thus, many programs rely on pg_dump to do the heavy lifting for them.
Parsing the text file generated by pg_dump is much easier than trying to
manipulate the system catalogs.)

So let's say the process is to create a new database, load things into it,
and then checksum the result. We can simulate that with pg_bench:

dropdb qa1; dropdb qa2
createdb qa1; createdb qa2
pgbench qa1 -i 2>/dev/null
pgbench qa2 -i 2>/dev/null
pg_dump qa1 > dump1; pg_dump qa2 > dump2

$ md5sum dump1
39a2da5e51e8541e9a2c025c918bf463 dump1

This md5sum does not match our repo! It doesn't even match the other one:

$ md5sum dump2
4a977657dfdf910cb66c875d29cfebf2 dump2

It's the stats, or course, which has added a dose of randomness that was
not there before, and makes our checksums useless:

$ diff dump1 dump2 | cut -c1-50
100172c100172
< 'histogram_bounds', '{1,979,1974,2952,3973,4900
---
> 'histogram_bounds', '{8,1017,2054,3034,4045,513

With --no-statistics, the diff shows no difference, and the md5sum is
always the same.

Just to be clear, I love this patch, and I love the fact that one of our
major upgrade warts is finally getting fixed. I've tried fixing it myself a
few times over the last decade or so, but lacked the skills to do so. :) So
I am thrilled to have this finally done. I just don't think it should be
enabled by default for everything using pg_dump. For the record, I would
not strongly object to having stats on by default for binary dumps,
although I would prefer them off.

So why not just expect people to modify their programs to use
--no-statistics for cases like this? That's certainly an option, but it's
going to break a lot of existing things, and create branching code:

old code:
pg_dump mydb -f pg.dump

new code:
if pg_dump.version >= 18
pg_dump --no-statistics mydb -f pg.dump
else
pg_dump mydb -f pg.dump

Also, anything trained to parse pg_dump output will have to learn about the
new SELECT pg_restore_ calls with their multi-line formats (not 100% sure
we don't have that anywhere, as things like "SELECT setval" and "SELECT
set_config" are single line, but there may be existing things)

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

In response to

Re: Statistics Import and Export at 2025-02-28 20:54:03 from Jeff Davis

Responses

Re: Statistics Import and Export at 2025-03-01 20:48:18 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-03-01 18:56:51	Re: Statistics Import and Export
Previous Message	Alexander Lakhin	2025-03-01 18:20:00	Re: Statistics Import and Export