Re: Statistics Import and Export

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, jian he <jian(dot)universality(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, alvherre(at)alvh(dot)no-ip(dot)org
Subject: Re: Statistics Import and Export
Date: 2025-04-01 22:05:59
Message-ID: 88af35fe0ff24cc4e0700b841aa60a0865f11648.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2025-04-01 at 13:44 -0500, Nathan Bossart wrote:
> Apologies for the noise.  I noticed one more way to simplify 0002. 
> As
> before, there should be no functional differences.

To restate the problem: one of the problems being solved here is that
the existing code for custom-format dumps calls WriteToc twice. That
was not a big problem before this patch, when the contents of the
entries was easily accessible in memory. But the point of 0002 is to
avoid keeping all of the stats in memory at once, because that causes
bloat; and instead to query it on demand.

In theory, we could fix the pre-existing code by making the second pass
able to jump over the other contents of the entry and just update the
data offsets. But that seems invasive, at least to do it properly.

0001 sidesteps the problem by skipping the second pass if data's not
being dumped (because there are no offsets that need updating). The
worst case is when there are a lot of objects with a small amount of
data. But that's a worst case for stats in general, so I don't think
that needs to be solved here.

Issuing the stats queries twice is not great, though. If there's any
non-deterministic output in the query, that could lead to strangeness.
How bad can that be? If the results change in some way that looks
benign, but changes the length of the definition string, can it lead to
corruption of a ToC entry? I'm not saying there's a problem, but trying
to understand the risk of future problems.

For 0003, it makes an assumption about the way the scan happens in
WriteToc(). Can you add some additional sanity checks to verify that
something doesn't happen in a different order than we expect?

Also, why do we need the clause "WHERE s.tablename = ANY($2)"? Isn't
that already implied by "JOIN unnest($1, $2) ... s.tablename =
u.tablename"?

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2025-04-01 22:08:55 Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Previous Message Tom Lane 2025-04-01 22:05:38 Re: general purpose array_sort