Re: Statistics Import and Export

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, alvherre(at)alvh(dot)no-ip(dot)org
Subject: Re: Statistics Import and Export
Date: 2024-11-18 19:47:24
Message-ID: ZzuZzDMxe3ClqAgF@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 8, 2024 at 01:25:21PM -0500, Corey Huinker wrote:
> WHAT IS NOT DONE - EXTENDED STATISTICS
>
> It is a general consensus in the community that "nobody uses extended
> statistics", though I've had difficulty getting actual figures to back this
> up, even from my own employer. Surveying several vendors at PgConf.EU, the
> highest estimate was that at most 1% of their customers used extended
> statistics, though more probably should. This reinforces my belief that a
> feature that would eliminate a major pain point in upgrades for 99% of
> customers shouldn't be held back by the fact that the other 1% only have a
> reduced hassle.
>
> However, having relation and attribute statistics carry over on major
> version upgrades presents a slight problem: running vacuumdb
> --analyze-in-stages after such an upgrade is completely unnecessary for
> those without extended statistics, and would actually result in _worse_
> statistics for the database until the last stage is complete. Granted,
> we've had great difficulty getting users to know that vacuumdb is a thing
> that should be run, but word has slowly spread through our own
> documentation and those "This one simple trick will make your postgres go
> fast post-upgrade" blog posts. Those posts will continue to lurk in search
> results long after this feature goes into release, and it would be a rude
> surprise to users to find out that the extra work they put in to learn
> about a feature that helped their upgrade in 17 was suddenly detrimental
> (albeit temporarily) in 18. We should never punish people for only being a
> little-bit current in their knowledge. Moreover, this surprise would
> persist even after we add extended statistics import function
> functionality.
>
> I presented this problem to several people at PgConf.EU, and the consensus
> least-bad solution was that vacuumdb should filter out tables that are not
> missing any statistics when using options --analyze, --analyze-only, and
> --analyze-in-stages, with an additional flag for now called --force-analyze
> to restore the un-filtered functionality. This gives the outcome tree:
>
> 1. Users who do not have extended statistics and do not use (or not even
> know about) vacuumdb will be blissfully unaware, and will get better
> post-upgrade performance.
> 2. Users who do not have extended statistics but use vacuumdb
> --analyze-in-stages will be pleasantly surprised that the vacuumdb run is
> almost a no-op, and completes quickly. Those who are surprised by this and
> re-run vacuumdb --analyze-in-stages will get another no-op.
> 3. Users who have extended statistics and use vacuumdb --analyze-in-stages
> will get a quicker vacuumdb run, as only the tables with extended stats
> will pass the filter. Subsequent re-runs of vacuumdb --analyze-in-stages
> would be the no-op.
> 4. Users who have extended statistics and don't use vacuumdb will still get
> better performance than they would have without any stats imported.
>
> In case anyone is curious, I'm defining "missing stats" as a table/matview
> with any of the following:
>
> 1. A table with an attribute that lacks a corresponding pg_statistic row.
> 2. A table with an index with an expression attribute that lacks a
> corresponding pg_statistic row (non-expression attributes just borrow the
> pg_statistic row from the table's attribute).
> 3. A table with at least one extended statistic that does not have a
> corresponding pg_statistic_ext_data row.
>
> Note that none of these criteria are concerned with the substance of the
> statistics (ex. pg_statistic row should have mcv stats but does not),
> merely their row-existence. 
>
> Some rejected alternative solutions were:
>
> 1. Adding a new option --analyze-missing-stats. While simple, few people
> would learn about it, knowledge of it would be drowned out by the
> aforementioned sea of existing blog posts.
> 2. Adding --analyze-missing-stats and making --analyze-in-stages fail with
> an error message educating the user about --analyze-missing-stats. Users
> might not see the error, existing tooling wouldn't be able to act on the
> error, and there are legitimate non-upgrade uses of --analyze-in-stages.
>
> MAIN CONCERN GOING FORWARD
>
> This change to vacuumdb will require some reworking of the
> vacuum_one_database() function so that the list of tables analyzed is
> preserved across the stages, as subsequent stages runs won't be able to
> detect which tables were previously missing stats.

You seem to be optimizing for people using pg_upgrade, and for people
upgrading to PG 18, without adequately considering people using vacuumdb
in non-pg_upgrade situations, and people using PG 19+. Let me explain.

First, I see little concern here for how people who use --analyze and
--analyze-only independent of pg_upgrade will be affected by this.
While I recommend people decrease vacuum and analyze threshold during
non-peak periods:

https://momjian.us/main/blogs/pgblog/2017.html#January_3_2017

some people might just regenerate all statistics during non-peak periods
using these options. You can perhaps argue that --analyze-in-stages
would only be used by pg_upgrade so maybe that can be adjusted more
easily.

Second, the API for what --analyze and --analyze-only do will be very
confusing for people running, e.g., PG 20, because the average user
reading the option name will not guess it only adds missing statistics.

I think you need to rethink your approach and just accept that a mention
of the new preserving statistic behavior of pg_upgrade, and the new
vacuumdb API required, will be sufficient. In summary, I think you need
a new --compute-missing-statistics-only that can be combined with
--analyze, --analyze-only, and --analyze-in-stages to compute only
missing statistics, and document it in the PG 18 release notes.

Frankly, we have a similar problem with partitioned tables:

https://www.postgresql.org/docs/current/sql-analyze.html

For partitioned tables, ANALYZE gathers statistics by sampling
rows from all partitions; in addition, it will recurse into
each partition and update its statistics. Each leaf partition
is analyzed only once, even with multi-level partitioning. No
statistics are collected for only the parent table (without data
from its partitions), because with partitioning it's guaranteed
to be empty.

--> The autovacuum daemon does not process partitioned tables, nor
does it process inheritance parents if only the children are ever
modified. It is usually necessary to periodically run a manual
ANALYZE to keep the statistics of the table hierarchy up to date.

Now, you can say partitioned table statistics are not as important as
extended statistics, but that fact remains that we have these two odd
cases where special work must be done to generate statistics.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2024-11-18 19:56:52 Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description
Previous Message David E. Wheeler 2024-11-18 19:19:07 Re: RFC: Additional Directory for Extensions