Re: Statistics Import and Export

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, alvherre(at)alvh(dot)no-ip(dot)org
Subject: Re: Statistics Import and Export
Date: 2024-06-04 03:34:51
Message-ID: CADkLM=dNQzmdsWfXVVDBqv7gHsa-7dk3v13AqrAzBoJWTMrWEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 16, 2024 at 2:26 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
> >
> > Per previous comments, it was suggested by others that:
> >
> > - having them in SECTION_NONE was a grave mistake
> > - Everything that could belong in SECTION_DATA should, and the rest
> > should be in SECTION_POST_DATA
>
> I don't understand the gravity of the choice here: what am I missing?
>
> To be clear: I'm not arguing against it, but I'd like to understand it
> better. Perhaps it has to do with the relationship between the sections
> and the dependencies?
>

I'm with you, I don't understand the choice and would like to, but at the
same time it now works in the way others strongly suggested that it should,
so I'm still curious about the why.

There were several people expressing interest in this patch at pgconf.dev,
so I thought I'd post a rebase and give a summary of things to date.

THE INITIAL GOAL

The initial goal of this effort was to reduce upgrade downtimes by
eliminating the need for the vacuumdb --analyze-in-stages call that is
recommended (but often not done) after a pg_upgrade. The analyze-in-stages
steps is usually by far the longest part of a binary upgrade and is a
significant part of a restore from dump, so eliminating this step will save
users time, and eliminate or greatly reduce a potential pitfall to
upgrade...and thus reduce upgrade friction (read: excuses to not upgrade).

THE FUNCTIONS

These patches introduce two functions, pg_set_relation_stats() and
pg_set_attribute_stats(), which allow the caller to modify the statistics
of any relation, provided that they own that relation or have maintainer
privilege.

The function pg_set_relation_stats looks like this:

SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
150000::integer,
'relpages', 17::integer,
'reltuples', 400.0::real,
'relallvisible', 4::integer);

The function takes an oid of the relation to have stats imported, a version
number (SERVER_VERSION_NUM) for the source of the statistics, and then a
series of varargs organized as name-value pairs. Currently, three arg pairs
are required to properly set (relpages, reltuples, and relallvisible). If
all three are not present, the function will issue a warning, and the row
will not be updated.

The choice of varargs is a defensive one, basically ensuring that a
pgdump that includes statistics import calls will not fail on a future
version that does not have one or more of these values. The call itself
would fail to modify the relation row, but it wouldn't cause the whole
restore to fail. I'm personally not against having a fixed arg version of
this function, nor am I against having both at the same time, the varargs
version basically teeing up the fixed-param call appropriate for the
destination server version.

This function does an in-place update of the pg_class row to avoid bloat
pg_class, just like ANALYZE does. This means that this function call is
NON-transactional.

The function pg_set_attribute_stats looks like this:

SELECT pg_catalog.pg_set_attribute_stats(
'stats_export_import.test'::regclass,
'id'::name,
false::boolean,
150000::integer,
'null_frac', 0.5::real,
'avg_width', 2::integer,
'n_distinct', -0.1::real,
'most_common_vals', '{2,1,3}'::text,
'most_common_freqs', '{0.3,0.25,0.05}'::real[]
);

Like the first function, it takes a relation oid and a source server
version though that is in the 4th position. It also takes the name of an
attribute, and a boolean as to whether these stats are for inherited
statistics (true) or regular (false). Again what follows is a vararg list
of name-value pairs, each name corresponding to an attribute of pg_stats,
and expecting a value appropriate for said attribute of pg_stats. Note that
ANYARRAY values are passed in as text. This is done for a few reasons.
First, if the attribute is an array type, then the most_common_elements
value will be an array of that array type, and there is no way to represent
that in SQL (it instead gives a higher order array of the same base type).
Second, it allows us to import the values with a simple array_in() call.
Last, it allows for situations where the type name changed from source
system to destination (example: a schema-qualified extension type gets
moved to core).

There are lots of ways that this function call can go wrong. An invalid
attribute name, an invalid parameter name in a name-value pair, invalid
data type of parameter being passed in the value of a name-value pair, or
type coercion errors in array_in() to name just a few. All of these errors
result in a warning and the import failing, but the function completes
normally. Internal typecasting and array_in are all done with the _safe()
equivalents, and any such errors are re-emitted as warnings. The central
goal here is to not make a restore fail just because the statistics are
wonky.

Calls to pg_set_attribute_stats() are transactional. This wouldn't warrant
mentioning if not for pg_set_relation_stats() being non-transactional.

DUMP / RESTORE / UPGRADE

The code for pg_dump/restore/upgrade has been modified to allow for
statistics to be exported/imported by default. There are flags to prevent
this (--no-statistics) and there are flags to ONLY do statistics
(--statistics-only) the utility of which will be discussed later.

pg_dump will make queries of the source database, adjusting the syntax to
reflect the version of the source system. There is very little variance in
those queries, so it should be possible to query as far back as 9.2 and get
usable stats. The output of these calls will be a series of SELECT
statements, each one making a call to either pg_set_relation_stats (one per
table/index/matview) or pg_set_attribute_stats (one per attribute that had
a matching pg_statistic row).

The positioning of these calls in the restore sequence was originally set
up as SECTION_NONE, but it was strongly suggested that SECTION_DATA /
SECTION_POST_DATA was the right spot instead, and that's where they
currently reside.

The end result will be that the new database now has the stats identical
(or at least close to) the source system. Those statistics might be good or
bad, but they're almost certainly better than no stats at all. Even if they
are bad, they will be overwritten by the next ANALYZE or autovacuum.

WHAT IS NOT DONE

1. Extended Statistics, which are considerably more complex than regular
stats (stxdexprs is itself an array of pg_statistic rows) and thus more
difficult to express in a simple function call. They are also used fairly
rarely in customer installations, so leaving them out of the v1 patch
seemed like an easy trade-off.

2. Any sort of validity checking beyond data-types. This was initially
provided, verifying that arrays values representing frequencies must be
between 0.0 and 1.0, arrays that represent most common value frequencies
must be in monotonically non-increasing order, etc. but these were rejected
as being overly complex, potentially rejecting valid stats, and getting in
the way of an other use I hadn't considered.

3. Export functions. Strictly speaking we don't need them, but some
use-cases described below may make the case for including them.

OTHER USES

Usage of these functions is not restricted to upgrade/restore situations.
The most obvious use was to experiment with how the planner behaves when
one or more tables grow and/or skew. It is difficult to create a table with
10 billion rows in it, but it's now trivial to create a table that says it
has 10 billion rows in it.

This can be taken a step further, and in a way I had not anticipated -
actively stress-testing the planner by inserting wildly incorrect and/or
nonsensical stats. In that sense, these functions are a fuzzing tool that
happens to make upgrades go faster.

FUTURE PLANS

Integration with postgres_fdw is an obvious next step, allowing an ANALYZE
on a foreign table to, instead of asking for a remote row sample, to simply
export the stats of the remote table and import them into the foreign table.

Extended Statistics.

CURRENT PROGRESS

I believe that all outstanding questions/request were addressed, and the
patch is now back to needing a review.

FOR YOUR CONSIDERATION

Rebase (current as of f04d1c1db01199f02b0914a7ca2962c531935717) attached.

Attachment Content-Type Size
v22-0001-Create-pg_set_relation_stats-pg_set_attribute_st.patch text/x-patch 109.4 KB
v22-0002-Add-derivative-flags-dumpSchema-dumpData.patch text/x-patch 18.5 KB
v22-0003-Enable-dumping-of-table-index-stats-in-pg_dump.patch text/x-patch 33.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-06-04 04:07:09 Re: Conflict Detection and Resolution
Previous Message Andres Freund 2024-06-04 02:44:04 Re: Use the same Windows image on both VS and MinGW tasks