Re: Extended Statistics set/restore/clear functions.

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Extended Statistics set/restore/clear functions.
Date: 2025-01-23 20:15:53
Message-ID: b4ba8ed9-f1b3-4b6b-8612-ff42d65c18c6@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/23/25 15:51, Corey Huinker wrote:
> On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas(at)vondra(dot)me
> <mailto:tomas(at)vondra(dot)me>> wrote:
>
> Hi,
>
> Thanks for continuing to work on this.
>
> On 1/22/25 19:17, Corey Huinker wrote:
> > This is a separate thread for work started in [1] but focused
> purely on
> > getting the following functions working:
> >
> > * pg_set_extended_stats
> > * pg_clear_extended_stats
> > * pg_restore_extended_stats
> >
> > These functions are analogous to their relation/attribute
> counterparts,
> > use the same calling conventions, and build upon the same basic
> > infrastructure.
> >
> > I think it is important that we get these implemented because they
> close
> > the gap that was left in terms of the ability to modify existing
> > statistics and to round out the work being done to carry over
> statistics
> > via dump/restore and pg_upgrade i [1].
> >
> > The purpose of each patch is as follows (adapted from previous
> thread):
> >
> > 0001 - This makes the input function for pg_ndistinct functional.
> >
> > 0002 - This makes the input function for pg_dependencies functional.
> >
>
> I only quickly skimmed the patches, but a couple comments:
>
>
> 1) I think it makes perfect sense to use the JSON parsing for the input
> functions, but maybe it'd be better to adjust the format a bit to make
> that even easier?
>
> Right now the JSON "keys" have structure, which means we need some ad
> hoc parsing. Maybe we should make it "proper JSON" by moving that into
> separate key/value, e.g. for ndistinct we might replace this:
>
>   {"1, 2": 2323, "1, 3" : 3232, ...}
>
> with this:
>
>   [ {"keys": [1, 2], "ndistinct" : 2323},
>     {"keys": [1, 3], "ndistinct" : 3232},
>     ... ]
>
> so a regular JSON array of objects, with keys an "array". And similarly
> for dependencies.
>
>
> That is almost exactly what I did back when the stats import functions
> took a nested JSON argument.
>
> The biggest problem with changing that format is that the old format
> would still show up in the system being exported, so we would have to
> process that format as well as the new one.
>  
>
> Yes, it's more verbose, but maybe better for "mechanical" processing?
>
>
> It absolutely would be better for processing, but we'd still have to
> read the old format from older systems.  I suppose the pg_dump code
> could do some SQL gymnastics to convert the old json-but-sad format into
> the processing-friendly format of the future, and I could easily adapt
> what I've already written over a year ago to that task. I suppose it's
> just a matter of having the community behind changing the output format
> to enable a better input format.
>

D'oh! I always forget about the backwards compatibility issue, i.e. that
we still need to ingest values from already released versions. Yeah,
that makes the format change less beneficial.

>  
>
> 2) Do we need some sort of validation? Perhaps this was discussed in the
> other thread and I missed that, but isn't it a problem that happily
> accept e.g. this?
>
>   {"6666, 6666" : 1, "1, -222": 14, ...}
>
> That has duplicate keys with bogus attribute numbers, stats on (bogus)
> system attributes, etc. I suspect this may easily cause problems during
> planning (if it happens to visit those statistics).
>
>
> We used to have _lots_ of validation for data quality issues, much of
> which was removed at the request of reviewers. However, much of that
> discussion was about the health of the statistics, but these are
> standalone data types, maybe they're held to a higher standard. If so,
> what sort of checks do you think would be needed and/or wanted?
>
> So far, I can imagine the following:
>
> * no negative attnums in key list
> * no duplicate attnums in key list
>
> anything else?
>

Yeah, I recall there were a lot of checks initially and we dropped them
over time. I'm not asking to reinstate all of those thorough checks.

At this point I was really thinking only about validating the attnums,
i.e. to make sure it's a valid attribute in the table / statistics. That
is something the pg_set_attribute_stats() enforce too, thanks to having
a separate argument for the attribute name.

That's where I'd stop. I don't want to do checks on the statistics
content, like verifying the frequencies in the MCV sum up to 1.0 or
stuff like that. I think we're not doing that for pg_set_attribute_stats
either (and I'd bet one could cause a lot of "fun" this way).

>
> Maybe that's acceptable - ultimately the user could import something
> broken in a much subtler way, of course. But the pg_set_attribute_stats
> seems somewhat more protected against this, because it gets the attr as
> a separate argument.
>
>
> The datatype itself is in isolation, but once we've created a valid
> pg_ndistinct or pg_dependencies, there's nothing stopping us from
> validating the values in the datatype against the statistics object and
> the relation it belongs to, but that might get the same resistance that
> I got to say, ensuring that frequency lists were monotonically decreasing.
>  

Understood. IMHO it's fine to say we're not validating the statistics
are "consistent" but I think we should check it matches the definition.

>
> I recall I wished to have the attnum in the output function, but that
> was not quite possible because we don't know the relid (and thus the
> descriptor) in that function.
>
> Is it a good idea to rely on the input/output format directly? How will
> that deal with cross-version differences? Doesn't it mean the in/out
> format is effectively fixed, or at least has to be backwards compatible
> (i.e. new version has to handle any value from older versions)?
>
>
> Presently there are no cross-version differences, though earlier I
> address the pros and cons of changing it. No matter what, the burden of
> having a valid format is on the user in the case of
> pg_set_extended_stats() and pg_restore_extended_stats() has a server
> version number associated, so the option of handling a format change
> could be baked in, but then we're doing version tests and input
> typecasts like we do with ANYARRAY types. Not impossible, but definitely
> more work.
>

OK, makes sense.
 
>
> Or what if I want to import the stats for a table with slightly
> different structure (e.g. because dump/restore skips dropped columns).
> Won't that be a problem with the format containing raw attnums? Or is
> this a use case we don't expect to work?
>
>
> The family of pg_set_*_stats functions expect the input to be meaningful
> and correct for that relation on that server version. Any attnum
> translation would have to be done by the user to adapt to the new or
> changed relation.
>
> The family of pg_restore_*_stats functions are designed to be forward
> compatible, and to work across versions but for basically the same
> relation or relation of the same shape. Basically, they're for
> pg_restore and pg_upgrade, so no changes in attnums would be expected.
>  
>

OK

>
> For the per-attribute stats it's probably fine, because that's mostly
> just a collection of regular data types (scalar values or arrays of
> values, ...) and we're not modifying them except for maybe adding new
> fields. But extended stats seem more complex, so maybe it's different?
>
>
> I had that working by attname matching way back in the early days, but
> it would involve creating an intermediate format for translating the
> attnums to attnames on export, and then re-translating them on the way
> back in.
>
> I suppose someone could write the following utility functions
>
>     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
>> json
>     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
> pg_ndistinct
>
> and that would bridge the gap for the special case where you want to
> adapt pg_ndistinct from one table structure to a slightly different one.
>
>

OK
 
>
>
> I remember a long discussion about the format at the very beginning of
> this patch series, and the conclusion clearly was to have a function
> that import stats for one attribute at a time. And that seems to be
> working fine, but extended stats values have more internal structure, so
> perhaps they need to do something more complicated.
>
>
> I believe this *is* doing something more complicated, especially with
> the MCVList, though I was very pleased to see how much of the existing
> infrastructure I was able to reuse.
>
>

OK

>  
>
>
> > 0003 - Makes several static functions in attribute_stats.c public
> for use
> > by extended stats. One of those is get_stat_attr_type(), which in
> the last
> > patchset was modified to take an attribute name rather than
> attnum, thus
> > saving a syscache lookup. However, extended stats identifies
> attributes by
> > attnum not name, so that optimization had to be set aside, at least
> > temporarily.
> >
> > 0004 - These implement the functions pg_set_extended_stats(),
> > pg_clear_extended_stats(), and pg_restore_extended_stats() and
> behave like
> > their relation/attribute equivalents. If we can get these
> committed and
> > used by pg_dump, then we don't have to debate how to handle post-
> upgrade
> > steps for users who happen to have extended stats vs the approximately
> > 99.75% of users who do not have extended stats.
> >
>
> I see there's a couple MCV-specific functions in the extended_stats.c.
> Shouldn't those go into mvc.c instead?
>
>
> I wanted to put it there, but there was a reason I didn't and I've now
> forgotten what it was. I'll make an effort to relocate it to mcv.c.
>
> For that matter, it might make sense to break out the expressions code
> into its own file, because every other stat attribute has its own.
> Thoughts on that?
>  

+1 to that, if it reduced unnecessary code duplication

>
>
> FWIW there's a bunch of whitespace issues during git apply.
>
>
> +1
>  
>
>
> OK. Thanks for the patch!
>
>
> Thanks for the feedback, please keep it coming. I think it's really
> important that extended stats, though used rarely, be included in our
> dump/restore/upgrade changes so as to make for a more consistent user
> experience.
>  

I agree, and I appreciate you working on it.

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2025-01-23 20:25:45 Re: pg_createsubscriber TAP test wrapping makes command options hard to read.
Previous Message Andrew Dunstan 2025-01-23 20:12:22 Re: why -Fdance archive format option works with ./pg_restore but not with ./pg_dump?