Re: Statistics Import and Export

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, alvherre(at)alvh(dot)no-ip(dot)org
Subject: Re: Statistics Import and Export
Date: 2024-11-05 02:22:28
Message-ID: CADkLM=fR7TwH0cLREQkf5_=KLcOYVxJw0Km0i5MpaWeuDwVo6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>
> I'd also like feedback, though I feel very strongly that we should do what
> ANALYZE does. In an upgrade situation, nearly all tables will have stats
> imported, which would result in an immediate doubling of pg_class - not the
> end of the world, but not great either.
>
> Given the recent bugs associated with inplace updates and race conditions,
> if we don't want to do in-place here, we should also consider getting rid
> of it for ANALYZE. I briefly pondered if it would make sense to vertically
> partition pg_class into the stable attributes and the attributes that get
> modified in-place, but that list is pretty long: relpages, reltuples,
> relallvisible, relhasindex, reltoastrelid, relhasrules, relhastriggers,
> relfrozenxid, and reminmxid,
>
> If we don't want to do inplace updates in pg_restore_relation_stats(),
> then we could mitigate the bloat with a VACUUM FULL pg_class at the tail
> end of the upgrade if stats were enabled.
>
>
>> pg_restore_*_stats() functions. But there's a lot of overlap, so it may
>> be worth discussing again whether we should only have one set of
>> functions.
>>
>
> For the reason of in-place updates and error tolerance, I think they have
> to remain separate functions, but I'm also interested in hearing other's
> opinions.
>
>

A lot of stuff has been committed, but a lot still remains, so I'm going to
give a state-of-the-patchset update.

WHAT HAS BEEN COMMITTED

The functions pg_set_relation_stats(), and pg_set_attribute_stats(). These
are new functions allowing the table owner/maintainer to inject arbitrary
statistics into a table/index/matview or attribute thereof. The goal is to
provide a tool to test "what if" experiments on the planner.

The function pg_clear_relation_stats(). It resets pg_class stats variables
back to the new-table defaults. However, the potential change to make the
default -1 for partitioned tables/indexes means that this function would
need to be updated.

The function pg_clear_attribute_stats(). This function deletes the
pg_statistic row associated with a given attribute.

The functions pg_restore_relation_stats() and pg_restore_attribute_stats().
These perform a similar function to their -set counterparts, but in a way
more friendly to use via pg_upgrade, and using a parameter syntax that's
future-tolerant if not future-proof.

NEXT STEPS, PATCHES ATTACHED

0001

Add booleans dumpSchema, dumpData to the dump/restore options structures
without removing schemaOnly and dataOnly. This patch is borne of the
combinatorial complexity that we would have to deal with if we kept all
"should I dump this object?" logic based on schemaOnly, dataOnly, and a new
statisticsOnly flag. Better to just flip these negatives to positives and
resolve them once.

0002

Based on feedback from Nathan Bossart. This removes schemaOnly and dataOnly
from the dump/restore option structure as they are now redundant and could
cause confusion.

0003

This adds statistics import to pg_restore and pg_upgrade. This is what all
of these patches have been leading up to.

0004

Importing statistics in a pg_upgrade would touch all relations (tables,
indexes, matviews) in a database, meaning that each pg_class entry would be
updated. If those updates are not inplace (as is done in VACUUM and
ANALYZE), then we've just added near-100% bloat to pg_class. To avoid that,
we re-introduce inplace updates for pg_restore_relation_stats().

0005

Depending on the outcome of
https://www.postgresql.org/message-id/be7951d424b9c16c761c39b9b2677a57fb697b1f.camel@j-davis.com,
we may have to modify pg_clear_relation_stats() to give a different default
relpages depending on the relkind. If that comes to pass, then this patch
will be needed.

WHAT IS NOT DONE - EXTENDED STATISTICS

It is a general consensus in the community that "nobody uses extended
statistics", though I've had difficulty getting actual figures to back this
up, even from my own employer. Surveying several vendors at PgConf.EU, the
highest estimate was that at most 1% of their customers used extended
statistics, though more probably should. This reinforces my belief that a
feature that would eliminate a major pain point in upgrades for 99% of
customers shouldn't be held back by the fact that the other 1% only have a
reduced hassle.

However, having relation and attribute statistics carry over on major
version upgrades presents a slight problem: running vacuumdb
--analyze-in-stages after such an upgrade is completely unnecessary for
those without extended statistics, and would actually result in _worse_
statistics for the database until the last stage is complete. Granted,
we've had great difficulty getting users to know that vacuumdb is a thing
that should be run, but word has slowly spread through our own
documentation and those "This one simple trick will make your postgres go
fast post-upgrade" blog posts. Those posts will continue to lurk in search
results long after this feature goes into release, and it would be a rude
surprise to users to find out that the extra work they put in to learn
about a feature that helped their upgrade in 17 was suddenly detrimental
(albeit temporarily) in 18. We should never punish people for only being a
little-bit current in their knowledge. Moreover, this surprise would
persist even after we add extended statistics import function functionality.

I presented this problem to several people at PgConf.EU, and the consensus
least-bad solution was that vacuumdb should filter out tables that are not
missing any statistics when using options --analyze, --analyze-only, and
--analyze-in-stages, with an additional flag for now called --force-analyze
to restore the un-filtered functionality. This gives the outcome tree:

1. Users who do not have extended statistics and do not use (or not even
know about) vacuumdb will be blissfully unaware, and will get better
post-upgrade performance.
2. Users who do not have extended statistics but use vacuumdb
--analyze-in-stages will be pleasantly surprised that the vacuumdb run is
almost a no-op, and completes quickly. Those who are surprised by this and
re-run vacuumdb --analyze-in-stages will get another no-op.
3. Users who have extended statistics and use vacuumdb --analyze-in-stages
will get a quicker vacuumdb run, as only the tables with extended stats
will pass the filter. Subsequent re-runs of vacuumdb --analyze-in-stages
would be the no-op.
4. Users who have extended statistics and don't use vacuumdb will still get
better performance than they would have without any stats imported.

In case anyone is curious, I'm defining "missing stats" as a table/matview
with any of the following:

1. A table with an attribute that lacks a corresponding pg_statistic row.
2. A table with an index with an expression attribute that lacks a
corresponding pg_statistic row (non-expression attributes just borrow the
pg_statistic row from the table's attribute).
3. A table with at least one extended statistic that does not have a
corresponding pg_statistic_ext_data row.

Note that none of these criteria are concerned with the substance of the
statistics (ex. pg_statistic row should have mcv stats but does not),
merely their row-existence.

Some rejected alternative solutions were:

1. Adding a new option --analyze-missing-stats. While simple, few people
would learn about it, knowledge of it would be drowned out by the
aforementioned sea of existing blog posts.
2. Adding --analyze-missing-stats and making --analyze-in-stages fail with
an error message educating the user about --analyze-missing-stats. Users
might not see the error, existing tooling wouldn't be able to act on the
error, and there are legitimate non-upgrade uses of --analyze-in-stages.

MAIN CONCERN GOING FORWARD

This change to vacuumdb will require some reworking of the
vacuum_one_database() function so that the list of tables analyzed is
preserved across the stages, as subsequent stages runs won't be able to
detect which tables were previously missing stats.

Attachment Content-Type Size
v30-0001-Add-derivative-flags-dumpSchema-dumpData.patch application/x-patch 17.7 KB
v30-0004-Enable-in-place-updates-for-pg_restore_relation_.patch application/x-patch 5.4 KB
v30-0002-Remove-schemaOnly-dataOnly-from-dump-restore-opt.patch application/x-patch 10.9 KB
v30-0003-Enable-dumping-of-table-index-stats-in-pg_dump.patch application/x-patch 35.3 KB
v30-0005-Enable-pg_clear_relation_stats-to-handle-differe.patch application/x-patch 4.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-11-05 02:24:33 Re: Pgoutput not capturing the generated columns
Previous Message Peter Smith 2024-11-05 01:29:59 Re: Pgoutput not capturing the generated columns