From: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Jeff Davis <pgsql(at)j-davis(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, alvherre(at)alvh(dot)no-ip(dot)org |
Subject: | Re: Statistics Import and Export |
Date: | 2024-11-19 01:29:10 |
Message-ID: | CADkLM=cxNaU3O40Y5zGARJBWm-uAvb2hvgZxd=tOu-zTsrHpkQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Nov 18, 2024 at 2:47 PM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Fri, Nov 8, 2024 at 01:25:21PM -0500, Corey Huinker wrote:
> > WHAT IS NOT DONE - EXTENDED STATISTICS
> >
> > It is a general consensus in the community that "nobody uses extended
> > statistics", though I've had difficulty getting actual figures to
> back this
> > up, even from my own employer. Surveying several vendors at
> PgConf.EU, the
> > highest estimate was that at most 1% of their customers used extended
> > statistics, though more probably should. This reinforces my belief
> that a
> > feature that would eliminate a major pain point in upgrades for 99%
> of
> > customers shouldn't be held back by the fact that the other 1% only
> have a
> > reduced hassle.
> >
> > However, having relation and attribute statistics carry over on major
> > version upgrades presents a slight problem: running vacuumdb
> > --analyze-in-stages after such an upgrade is completely unnecessary
> for
> > those without extended statistics, and would actually result in
> _worse_
> > statistics for the database until the last stage is complete.
> Granted,
> > we've had great difficulty getting users to know that vacuumdb is a
> thing
> > that should be run, but word has slowly spread through our own
> > documentation and those "This one simple trick will make your
> postgres go
> > fast post-upgrade" blog posts. Those posts will continue to lurk in
> search
> > results long after this feature goes into release, and it would be a
> rude
> > surprise to users to find out that the extra work they put in to
> learn
> > about a feature that helped their upgrade in 17 was suddenly
> detrimental
> > (albeit temporarily) in 18. We should never punish people for only
> being a
> > little-bit current in their knowledge. Moreover, this surprise would
> > persist even after we add extended statistics import function
> > functionality.
> >
> > I presented this problem to several people at PgConf.EU, and the
> consensus
> > least-bad solution was that vacuumdb should filter out tables that
> are not
> > missing any statistics when using options --analyze, --analyze-only,
> and
> > --analyze-in-stages, with an additional flag for now called
> --force-analyze
> > to restore the un-filtered functionality. This gives the outcome
> tree:
> >
> > 1. Users who do not have extended statistics and do not use (or not
> even
> > know about) vacuumdb will be blissfully unaware, and will get better
> > post-upgrade performance.
> > 2. Users who do not have extended statistics but use vacuumdb
> > --analyze-in-stages will be pleasantly surprised that the vacuumdb
> run is
> > almost a no-op, and completes quickly. Those who are surprised by
> this and
> > re-run vacuumdb --analyze-in-stages will get another no-op.
> > 3. Users who have extended statistics and use vacuumdb
> --analyze-in-stages
> > will get a quicker vacuumdb run, as only the tables with extended
> stats
> > will pass the filter. Subsequent re-runs of vacuumdb
> --analyze-in-stages
> > would be the no-op.
> > 4. Users who have extended statistics and don't use vacuumdb will
> still get
> > better performance than they would have without any stats imported.
> >
> > In case anyone is curious, I'm defining "missing stats" as a
> table/matview
> > with any of the following:
> >
> > 1. A table with an attribute that lacks a corresponding pg_statistic
> row.
> > 2. A table with an index with an expression attribute that lacks a
> > corresponding pg_statistic row (non-expression attributes just
> borrow the
> > pg_statistic row from the table's attribute).
> > 3. A table with at least one extended statistic that does not have a
> > corresponding pg_statistic_ext_data row.
> >
> > Note that none of these criteria are concerned with the substance of
> the
> > statistics (ex. pg_statistic row should have mcv stats but does not),
> > merely their row-existence.
> >
> > Some rejected alternative solutions were:
> >
> > 1. Adding a new option --analyze-missing-stats. While simple, few
> people
> > would learn about it, knowledge of it would be drowned out by the
> > aforementioned sea of existing blog posts.
> > 2. Adding --analyze-missing-stats and making --analyze-in-stages
> fail with
> > an error message educating the user about --analyze-missing-stats.
> Users
> > might not see the error, existing tooling wouldn't be able to act on
> the
> > error, and there are legitimate non-upgrade uses of
> --analyze-in-stages.
> >
> > MAIN CONCERN GOING FORWARD
> >
> > This change to vacuumdb will require some reworking of the
> > vacuum_one_database() function so that the list of tables analyzed is
> > preserved across the stages, as subsequent stages runs won't be able
> to
> > detect which tables were previously missing stats.
>
> You seem to be optimizing for people using pg_upgrade, and for people
> upgrading to PG 18, without adequately considering people using vacuumdb
> in non-pg_upgrade situations, and people using PG 19+. Let me explain.
>
This was a concern as I was polling people.
A person using vacuumdb in a non-upgrade situation is, to my limited
imagination, one of three types:
1. A person who views vacuumdb as a worthwhile janitorial task for
downtimes.
2. A person who wants stats on a lot of recently created tables.
3. A person who wants better stats on a lot of recently (re)populated
tables.
The first group would not be using --analyze-in-stages or --analyze-only,
because the vacuuming is a big part of it. They will be unaffected.
The second group will be pleasantly surprised to learn that they no longer
need to specify a subset of tables, as any table missing stats will get
picked up.
The third group would be surprised that their operation completed so
quickly, check the docs, add in --force-analyze to their script, and re-run.
>
> First, I see little concern here for how people who use --analyze and
> --analyze-only independent of pg_upgrade will be affected by this.
> While I recommend people decrease vacuum and analyze threshold during
> non-peak periods:
>
> https://momjian.us/main/blogs/pgblog/2017.html#January_3_2017
>
> some people might just regenerate all statistics during non-peak periods
> using these options. You can perhaps argue that --analyze-in-stages
> would only be used by pg_upgrade so maybe that can be adjusted more
> easily.
>
I, personally, would be fine if this only modified --analyze-in-stages, as
it already carries the warning:
"This option is only useful to analyze a database that currently has no
statistics or has wholly incorrect ones, such as if it is newly populated
from a restored dump or by pg_upgrade. Be aware that running with this
option in a database with existing statistics may cause the query optimizer
choices to become transiently worse due to the low statistics targets of
the early stages."
But others felt that --analyze-only should be in the mix as well.
No one advocated for changing the behavior of options that involve actual
vacuuming.
> Second, the API for what --analyze and --analyze-only do will be very
> confusing for people running, e.g., PG 20, because the average user
> reading the option name will not guess it only adds missing statistics.
>
> I think you need to rethink your approach and just accept that a mention
> of the new preserving statistic behavior of pg_upgrade, and the new
> vacuumdb API required, will be sufficient. In summary, I think you need
> a new --compute-missing-statistics-only that can be combined with
> --analyze, --analyze-only, and --analyze-in-stages to compute only
> missing statistics, and document it in the PG 18 release notes.
>
A --missing-only/--analyze-missing-in-stages option was my first idea, and
it's definitely cleaner, but as I stated in the rejected ideas section
above, when I reached out to others at PgConf.EU there was pretty general
consensus that few people would actually read our documentation, and the
few that have in the past are unlikely to read it again to discover the new
option, and those people would have a negative impact of using
--analyze-in-stages, effectively punishing them for having once read the
documentation (or a blog post) but not re-read it prior to upgrade.
So, to add non-pg_upgrade users to the outcome tree in my email from
2024-11-04:
5. Users who use vacuumdb in a non-upgrade situation and do not use either
--analyze-in-stages or --analyze-only will be completely unaffected.
6. Users who use vacuumdb in a non-upgrade situation with either
--analyze-in-stages or --analyze-only set will find that the operation
skips tables that already have stats, and will have to add --force-analyze
to restore previous behavior.
That's not a great surprise for group 6, but I have to believe that group
is smaller than group 5, and it's definitely smaller than the group of
users that need to upgrade.
To conclude, there's no great option here, and I went with what seemed to
be the least-bad option in terms of user impact. I'm not locked in on any
one solution, and I really hope somebody can come up with one that we all
like.
Attached is a re-basing of the existing patchset, plus 3 more additions:
0012 - disallows setting stats on system columns per Heikki's observation
0013 - Consolidates two attribute cache lookups into one
0014 - Adds a --no-data option to pg_dump/pg_restore. This will be
particularly useful in diagnostic situations where a user wants to analyze
query plans in a sandbox without having to load all the data into the
sandbox. Instead, they can just bring over the schema and statistics and
run their experiments. This patch needs a regression test but I thought I'd
put it out there early as a way of showing the value of the stats import
functions outside of upgrades.
>
From | Date | Subject | |
---|---|---|---|
Next Message | Sutou Kouhei | 2024-11-19 01:31:15 | Re: Make COPY format extendable: Extract COPY TO format implementations |
Previous Message | Michael Paquier | 2024-11-19 01:15:28 | Re: Making error message more user-friendly with spaces in a URI |