From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
Cc: | yuzuko <yuzukohosoya(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, David Steele <david(at)pgmasters(dot)net>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Amit Langote <amitlangote09(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Greg Stark <stark(at)mit(dot)edu> |
Subject: | Re: Autovacuum on partitioned table (autoanalyze) |
Date: | 2021-07-22 20:54:58 |
Message-ID: | 20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2021-04-08 01:20:14 -0400, Alvaro Herrera wrote:
> On 2021-Apr-07, Alvaro Herrera wrote:
>
> > OK, I bit the bullet and re-did the logic in the way I had proposed
> > earlier in the thread: do the propagation on the collector's side, by
> > sending only the list of ancestors: the collector can read the tuple
> > change count by itself, to add it to each ancestor. This seems less
> > wasteful. Attached is v16 which does it that way and seems to work
> > nicely under my testing.
>
> Pushed with this approach. Thanks for persisting with this.
I'm looking at this in the context of rebasing & polishing the shared
memory stats patch.
I have a few questions / concerns:
1) Somehow it seems like a violation to do stuff like
get_partition_ancestors() in pgstat.c. It's nothing I can't live with, but
it feels a bit off. Would likely not be too hard to address, e.g. by just
putting some of pgstat_report_anl_ancestors in partition.c instead.
2) Why does it make sense that autovacuum sends a stats message for every
partition in the system that had any chances since the last autovacuum
cycle? On a database with a good number of objects / a short naptime we'll
often end up sending messages for the same set of tables from separate
workers, because they don't yet see the concurrent
tabentry->changes_since_analyze_reported.
3) What is the goal of the autovac_refresh_stats() after the loop doing
pgstat_report_anl_ancestors()? I think it'll be common that the stats
collector hasn't even processed the incoming messages by that point, not to
speak of actually having written out a new stats file. If it took less than
10ms (PGSTAT_RETRY_DELAY) to get to autovac_refresh_stats(),
backend_read_statsfile() will not wait for a new stats file to be written
out, and we'll just re-read the state we previously did.
It's pretty expensive to re-read the stats file in some workloads, so I'm a
bit concerned that we end up significantly increasing the amount of stats
updates/reads, without actually gaining anything reliable?
4) In the shared mem stats patch I went to a fair bit of trouble to try to get
rid of pgstat_vacuum_stat() (which scales extremely poorly to larger
systems). For that to work pending stats can only be "staged" while holding
a lock on a relation that prevents the relation from being concurrently
dropped (pending stats increment a refcount for the shared stats object,
which ensures that we don't loose track of the fact that a stats object has
been dropped, even when stats only get submitted later).
I'm not yet clear on how to make this work for
pgstat_report_anl_ancestors() - but I probably can find a way. But it does
feel a bit off to issue stats stuff for tables we're not sure still exist.
I'll go and read through the thread, but my first thought is that having a
hashtable in do_autovacuum() that contains stats for partitioned tables would
be a good bit more efficient than the current approach? We already have a
hashtable for each toast table, compared to that having a hashtable for each
partitioned table doesn't seem like it'd be a problem?
With a small bit of extra work that could even avoid the need for the
additional pass through pg_class. Do the partitioned table data-gathering as
part of the "collect main tables to vacuum" pass, and then do one of
a) do the perform-analyze decision purely off the contents of that
partioned-table hash
b) fetch the RELOID syscache entry by oid and then decide based on that
c) handle partioned tableas as part of the "check TOAST tables" pass - it's
not like we gain a meaningful amount of efficiency by using a ScanKey to
filter for RELKIND_TOASTVALUE, given that there's no index, and that an
index wouldn't commonly be useful given the percentage of toast tables in
pg_class
Partitioning makes it a bigger issue that do_autovacuum() does multiple passes
through pg_class (as it makes scenarios in which pg_class is large more
common), so I don't think it's great that partitioning also increases the
number of passes through pg_class.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2021-07-22 21:09:54 | Re: Incorrect usage of strtol, atoi for non-numeric junk inputs |
Previous Message | Bauyrzhan Sakhariyev | 2021-07-22 20:49:17 | Re: truncating timestamps on arbitrary intervals |