Re: Statistics Import and Export

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, alvherre(at)alvh(dot)no-ip(dot)org
Subject: Re: Statistics Import and Export
Date: 2024-07-23 04:20:32
Message-ID: CADkLM=evyx0L96XxrLSUDVS+iXgHszLbWMrGwVbBnNVO1ejfJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> * pg_set_relation_stats(): the warning: "cannot export statistics
> prior to version 9.2" doesn't make sense because the function is for
> importing. Reword.
>

+1

> * I really think there should be a transactional option, just another
> boolean, and if it has a default it should be true. This clearly has
> use cases for testing plans, etc., and often transactions will be the
> right thing there. This should be a trivial code change, and it will
> also be easier to document.
>

For it to have a default, the parameter would have to be at the end of the
list, and it's a parameter list that will grow in the future. And when that
happens we have a jumbled parameter list, which is fine if we only ever
call params by name, but I know some people won't do that. Which means it's
up front right after `version`. Since `version` is already in there, and we
can't default that, I feel ok about moving it there, but alas no default.

If there was some way that the function could detect that it was in a
binary upgrade, then we could use that to determine if it should update
inplace or transactionally.

* The return type is documented as 'void'? Please change to bool and
> be clear about what true/false returns really mean. I think false means
> "no updates happened at all, and a WARNING was printed indicating why"
> whereas true means "all updates were applied successfully".
>

Good point, that's a holdover.

> * An alternative would be to have an 'error_ok' parameter to say
> whether to issue WARNINGs or ERRORs. I think we already discussed that
> and agreed on the boolean return, but I just want to confirm that this
> was a conscious choice?
>

That had been discussed as well. If we're adding parameters, then we could
add one for that too. It's making the function call progressively more
unwieldy, but anyone who chooses to wield these on a regular basis can
certainly write a SQL wrapper function to reduce the function call to their
presets, I suppose.

> * tests should be called stats_import.sql; there's no exporting going
> on
>

Sigh. True.

> * Aside from the above comments and some other cleanup, I think this
> is a simple patch and independently useful. I am looking to commit this
> one soon.
>
> v24-0002:
>
> * Documented return type is 'void'
>
> * I'm not totally sure what should be returned in the event that some
> updates were applied and some not. I'm inclined to say that true should
> mean that all updates were applied -- otherwise it's hard to
> automatically detect some kind of typo.
>

Me either. Suggestions welcome.

I suppose we could return two integers: number of stats input, and number
of stats applied. But that could be confusing, as some parameter pairs form
one stat ( MCV, ELEM_MCV, etc).

I suppose we could return a set of (param_name text, was_set boolean,
applied boolean), without trying to organize them into their pairs, but
that would get really verbose.

We should decide on something soon, because we'd want relation stats to
follow a similar signature.

>
> * Can you describe your approach to error checking? What kinds of
> errors are worth checking, and which should we just put into the
> catalog and let the planner deal with?
>

1. When the parameters given make for something nonsensical Such as
providing most_common_elems with no corresponding most_common_freqs, then
you can't form an MCV stat, so you must throw out the one you did receive.
That gets a warning.

2. When the data provided is antithetical to the type of statistic. For
instance, most array-ish parameters can't have NULL values in them (there
are separate stats for nulls (null-frac, empty_frac). I don't remember if
doing so crashes the server or just creates a hard error, but it's a big
no-no, and we have to reject such stats, which for now means a warning and
trying to carry on with the stats that remain.

3. When the stats provided would overflow the data structure. We attack
this from two directions: First, we eliminate stat kinds that are
meaningless for the data type (scalars can't have most-common-elements,
only ranges can have range stats, etc), issue warnings for those and move
on with the remaining stats. If, however, the number of those statkinds
exceeds the number of statkind slots available, then we give up because now
we'd have to CHOOSE which N-5 stats to ignore, and the caller is clearly
just having fun with us.

We let the planner have fun with other error-like things:

1. most-common-element arrays where the elements are not sorted per spec.

2. frequency histograms where the numbers are not monotonically
non-increasing per spec.

3. frequency histograms that have corresponding low bound and high bound
values embedded in the array, and the other values in that array must be
between the low-high.

>
> * I'd check stakindidx at the time that it's incremented rather than
> summing boolean values cast to integers.
>

Which means that we're checking that and potentially raising the same error
in 3-4 places (and growing, unless we raise the max slots), rather than 1.
That struck me as worse.

>
> v24-0003:
>
> * I'm not convinced that we should continue when a stat name is not
> text. The argument for being lenient is that statistics may change over
> time, and we might have to ignore something that can't be imported from
> an old version into a new version because it's either gone or the
> meaning has changed too much. But that argument doesn't apply to a
> bogus call, where the name/value pairs get misaligned or something.
>

I agree with that.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-07-23 04:21:10 Re: Allow logical failover slots to wait on synchronous replication
Previous Message jian he 2024-07-23 04:03:57 Re: Virtual generated columns