From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE |
Date: | 2024-10-27 04:09:28 |
Message-ID: | 20241027040928.9e.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote:
> intra-grant-inplace-db.spec got a novel failure today:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58
>
> The isolationtester_waiting query is supposed to detect that step vac2 is
> blocked. vac2 didn't finish within the timeout, but isolationtester_waiting
> never considered it blocked.
> ... work on reproducing this.
I'm running loops of three workloads that might reproduce this, on s390x:
- A buildfarm config like buildfarm member sarus.
- A workload where autovacuum vac_update_datfrozenxid() takes >10s, but
non-auto "VACUUM (ONLY_DATABASE_STATS)" runs in a loop under
lock_timeout=5s. This notices if the non-auto VACUUM ever fails to cancel
autovacuum. I apply the attached inplace200-bench-vac-v0.1.patch, then run
contrib/amcheck/t/089_vac.pl.
- A workload to make vac2's deadlock detector run when an autovacuum worker is
starting. That's one of my candidate explanations of the events behind the
failure log. I apply the attached inplace190-repro-autovacuum-v0.1.patch,
then run intra-grant-inplace-db.spec with debug_parallel_query=regress and
autovacuum_naptime=1s. This regularly gets vac2 to cancel autovacuum.
So far, those have seen nothing like the failed run.
I'm inclined to make isolationtester log pg_stat_activity and pg_locks
whenever a step times out. That would have ruled out many explanations for
what happened on sarus. However, logging on timeout won't help much until
something can reproduce the timeout.
> It's odd that a new
> auto-analyze starts every minute, each of which exits due to blocking vac2.
Every autovacuum_naptime, a new worker cancels the old one due to wanting the
old worker's LOCKTAG_DATABASE_FROZEN_IDS. One can see this with "BEGIN; GRANT
TEMP ON DATABASE postgres TO pg_monitor;". That open transaction's xmax
prevents vac_update_datfrozenxid() from finishing. While that transaction
remains open, every autovacuum_naptime after the first will log a cancellation
of the previous autovacuum worker. (This assumes each autovacuum worker,
having negligible real work to do, reaches vac_update_datfrozenxid() before
the next autovacuum_naptime.) That's likely not ideal, but it doesn't explain
the sarus failure.
Attachment | Content-Type | Size |
---|---|---|
inplace200-bench-vac-v0.1.patch | text/plain | 2.1 KB |
inplace190-repro-autovacuum-v0.1.patch | text/plain | 2.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Lakhin | 2024-10-27 05:00:00 | Re: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE |
Previous Message | Andy Fan | 2024-10-27 03:59:51 | Re: -Wformat-signedness |