Re: Statistics Import and Export

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Jeff Davis <pgsql(at)j-davis(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Statistics Import and Export
Date: 2024-03-31 18:48:19
Message-ID: 790773.1711910899@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My apologies for having paid so little attention to this thread for
months. I got around to reading the v15 patches today, and while
I think they're going in more or less the right direction, there's
a long way to go IMO.

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute. (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go. A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof. What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters. The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like. As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use, and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
inherited bool, source_version int,
variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
false, -- not inherited
16, -- source server major version
-- pairs of labels and values follow
'null_frac', 0.4,
'avg_width', 42,
'histogram_bounds',
array['a', 'b', 'c']::text[],
...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks. If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation). This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on. This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function. (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

If we write the labels as undecorated string literals as I show
above, I think they will arrive at the function as "unknown"-type
constants, which is a little weird but doesn't seem like it's
really a big problem. The alternative is to cast them all to text
explicitly, but that's adding notation to no great benefit IMO.

pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too). If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I've not read the patches in great detail, but I did note a
few low-level issues:

* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct. Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?) That all looks
overly complicated and duplicative. I think you don't need two
layers of function there.

* I find the stuff with enums and "const char *param_names" to
be way too cute and unlike anything we do elsewhere. Please
don't invent your own notations for coding patterns that have
hundreds of existing instances. pg_set_relation_stats, for
example, has absolutely no reason not to look like the usual

Oid relid = PG_GETARG_OID(0);
float4 relpages = PG_GETARG_FLOAT4(1);
... etc ...

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.
I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all. (Maybe they don't, see
next item.)

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try. As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely? You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all. (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.) The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA". I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations. We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries. But I don't think we need to.

* dumpRelationStats seems to be dumping the relation-level
stats twice.

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Wienhold 2024-03-31 18:57:57 Re: Add column name to error description
Previous Message Tom Lane 2024-03-31 18:41:00 Re: Statistics Import and Export