From: | Kouber Saparev <kouber(at)gmail(dot)com> |
---|---|
To: | pgsql-admin(at)lists(dot)postgresql(dot)org |
Subject: | Stalled replica (terminating connection due to conflict with recovery) |
Date: | 2023-02-01 22:37:05 |
Message-ID: | CAN4RuQs6OfWZonT9DjiLGtwA7p+rifOn4GKCgUoNx1zMYmTjPA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
We are using a cascading streaming replication setup with PostgreSQL 14.6:
db1 -> db2 -> db.warehouse
hot_standby_feedback = on
max_standby_streaming_delay = 15min
On the master node (db1) some piece of code is creating and dropping
temporary tables all the time, thus bloating pg_attribute, pg_depend, etc.
The rate is around 1500 inserts + 1500 deletes per second on pg_attribute.
As a result autovacuum is launched every minute on the latter.
The situation we are facing is that in some occasions the replica
(db.warehouse) becomes unresponsive for a specific database - nobody could
connect to the database (through psql or other client) for several minutes
until eventually we restart the database. Trying to connect with psql (or
pg_activity) hangs forever.
Connecting to other databases, such as postgres, succeeds. Then eventually
we could see the list of queries within pg_stat_activity and kill them all
with pg_terminate_backend(). Regardless of that, the database in question
is still refusing new connections, throwing some "terminating connection
due to conflict with recovery" errors eventually in the log.
There are 3 rows within pg_stat_activity (except the one issuing the query
to get them):
RECORD 1:
backend_start | 2023-02-01 13:21:44.167132+00
wait_event_type | IPC
wait_event | RecoveryConflictSnapshot
RECORD 2:
backend_start | 2023-02-01 13:21:44.202399+00
wait_event_type | Activity
wait_event | BgWriterMain
RECORD 3:
backend_start | 2023-02-01 13:21:44.202096+00
wait_event_type | Activity
wait_event | CheckpointerMain
No long running queries, nothing... just these 3 lines and a database that
is not willing to accept connections.
Any idea what might be causing this behaviour? We highly suspect the crazy
writing over pg_attribute on the master node, but still - a cancelled query
because of a conflict is one thing, a stalled database not accepting
connections for anybody - a little disaster.
Any advice on how to avoid this would be highly appreciated.
--
Kouber Saparev
From | Date | Subject | |
---|---|---|---|
Next Message | sireesha | 2023-02-01 22:45:31 | Re: Mutex error 22 - Postgres version 14 |
Previous Message | Tom Lane | 2023-02-01 22:18:17 | Re: Mutex error 22 - Postgres version 14 |