Re: autovacuum on primary blocking queries on replica?

From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To: Don Seiler <don(at)seiler(dot)us>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: autovacuum on primary blocking queries on replica?
Date: 2022-05-29 15:02:28
Message-ID: 0ffa96ab1ec3b22bb738b15a93612e69d1e5c11c.camel@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 2022-05-27 at 14:00 -0500, Don Seiler wrote:
>  * PostgreSQL 12.9 - PGDG Ubuntu 18.04 image
>  * Streaming physical replication
>  * hot_standby_feedback = on
> We use a read replica to offload a lot of (what should be) quick queries. This morning we had an incident
> where these queries were all blocking on AccessShareLock waits, written to the log as:
>
> 2022-05-27 15:23:53.476 UTC [8185] foo(at)foo_all - myapp LOG:  process 8185 still waiting for AccessShareLock on relation 16834 of database 16401 after 1000.228 ms at character 204
> 2022-05-27 15:23:53.476 UTC [8185] foo(at)foo_all - myapp DETAIL:  Process holding the lock: 10822. Wait queue: 32373, 8185, 13782, [...]
> This went on for 30 seconds (the value of max_standby_streaming_delay) until PG killed the blocking process:
> 2022-05-27 15:24:22.474 UTC [10822] foo(at)foo_all - anotherapp FATAL:  terminating connection due to conflict with recovery
> 2022-05-27 15:24:22.474 UTC [10822] foo(at)foo_all - anotherapp DETAIL:  User was holding a relation lock for too long.
>
> I'm trying to find the root cause of why this started. We did see an UPDATE [...]

An UPDATE cannot be the problem.

> 15 seconds later we then see an aggressive autovacuum on this table:
>
> 2022-05-27 15:23:52.507 UTC [30513] LOG:  automatic aggressive vacuum of table "foo_all.public.industry": index scans: 1
>         pages: 252 removed, 323 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 8252 removed, 8252 remain, 0 are dead but not yet removable, oldest xmin: 1670999292
>         buffer usage: 12219 hits, 137 misses, 54 dirtied
>         avg read rate: 2.372 MB/s, avg write rate: 0.935 MB/s
>         system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.45 s
>
> and less than a second after that is when we see the first AccessShareLock message on the replica.
>
> I've been reading tales of autovacuum taking an AccessExclusiveLock when truncating empty pages at the end of a table.
> I'm imagining that updating every row of a table and then rolling back would leave all of those rows empty at the end
> and qualify for truncation and lead to the scenario I saw this morning.
>
> I'm still not entirely satisfied since that table in question was so small (only 8252 rows) so I wouldn't imagine it would
> hold things up as long as it did. Although the blocking session on the replica was an application session,
> not any background/recovery process.

I think you are on the right trail.

VACUUM will truncate trailing pages if it can get a (very short) ACCESS EXCLUSIVE lock on the table.
Now that lock and the truncation is replicated, and they can create a replication conflict just like
you describe. Even if the lock is held for a very short time, replaying it will conflict with any
query on that table on the standby.

You can disable autovacuum truncation on the table with

ALTER TABLE foo_all.public.industry SET (vacuum_truncate = off);

if you know that you can do without autovacuum truncation for that query.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Shaheed Haque 2022-05-29 16:03:20 Re: JSONB index not in use, but is TOAST the real cause of slow query?
Previous Message Tom Lane 2022-05-29 14:58:52 Re: JSONB index not in use, but is TOAST the real cause of slow query?