Primary and standby setting cross-checks

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Primary and standby setting cross-checks
Date: 2024-08-29 18:52:06
Message-ID: 19f884d1-66e8-4831-90a7-1e480ff2272f@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Currently, if you configure a hot standby server with a smaller
max_connections setting than the primary, the server refuses to start up:

LOG: entering standby mode
FATAL: recovery aborted because of insufficient parameter settings
DETAIL: max_connections = 10 is a lower setting than on the primary
server, where its value was 100.
HINT: You can restart the server after making the necessary
configuration changes.

Or if you change the setting in the primary while the standby is
running, replay pauses:

WARNING: hot standby is not possible because of insufficient parameter
settings
DETAIL: max_connections = 100 is a lower setting than on the primary
server, where its value was 200.
CONTEXT: WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE:
max_connections=200 max_worker_processes=8 max_wal_senders=10
max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical
wal_log_hints=off track_commit_timestamp=off
LOG: recovery has paused
DETAIL: If recovery is unpaused, the server will shut down.
HINT: You can then restart the server after making the necessary
configuration changes.
CONTEXT: WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE:
max_connections=200 max_worker_processes=8 max_wal_senders=10
max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical
wal_log_hints=off track_commit_timestamp=off

Both of these are rather unpleasant behavior.

I thought I could get rid of that limitation with my CSN snapshot patch
[1], because it gets rid of the fixed-size known-assigned XIDs array,
but there's a second reason for these limitations. It's also used to
ensure that the standby has enough space in the lock manager to hold
possible AccessExclusiveLocks taken by transactions in the primary.

So firstly, I think that's a bad tradeoff. In vast majority of cases,
you would not run out of lock space anyway, if you just started up the
system. Secondly, that cross-check of settings doesn't fully prevent the
problem. It ensures that the lock tables are large enough to accommodate
all the locks you could possibly hold in the primary, but that doesn't
take into account any additional locks held by read-only queries in the
hot standby. So if you have queries running in the standby that take a
lot of locks, this can happen anyway:

2024-08-29 21:44:32.634 EEST [668327] FATAL: out of shared memory
2024-08-29 21:44:32.634 EEST [668327] HINT: You might need to increase
"max_locks_per_transaction".
2024-08-29 21:44:32.634 EEST [668327] CONTEXT: WAL redo at 2/FD40FCC8
for Standby/LOCK: xid 996 db 5 rel 154045
2024-08-29 21:44:32.634 EEST [668327] WARNING: you don't own a lock of
type AccessExclusiveLock
2024-08-29 21:44:32.634 EEST [668327] LOG: RecoveryLockHash contains
entry for lock no longer recorded by lock manager: xid 996 database 5
relation 154045
TRAP: failed Assert("false"), File:
"../src/backend/storage/ipc/standby.c", Line: 1053, PID: 668327
postgres: startup recovering
0000000100000002000000FD(ExceptionalCondition+0x6e)[0x556a4588396e]
postgres: startup recovering
0000000100000002000000FD(+0x44156e)[0x556a4571356e]
postgres: startup recovering
0000000100000002000000FD(StandbyReleaseAllLocks+0x78)[0x556a45712738]
postgres: startup recovering
0000000100000002000000FD(ShutdownRecoveryTransactionEnvironment+0x15)[0x556a45712685]
postgres: startup recovering
0000000100000002000000FD(shmem_exit+0x111)[0x556a457062e1]
postgres: startup recovering
0000000100000002000000FD(+0x434132)[0x556a45706132]
postgres: startup recovering
0000000100000002000000FD(proc_exit+0x59)[0x556a45706079]
postgres: startup recovering
0000000100000002000000FD(errfinish+0x278)[0x556a45884708]
postgres: startup recovering
0000000100000002000000FD(LockAcquireExtended+0xa46)[0x556a45719386]
postgres: startup recovering
0000000100000002000000FD(StandbyAcquireAccessExclusiveLock+0x11d)[0x556a4571330d]
postgres: startup recovering
0000000100000002000000FD(standby_redo+0x70)[0x556a45713690]
postgres: startup recovering
0000000100000002000000FD(PerformWalRecovery+0x7b3)[0x556a4547d313]
postgres: startup recovering
0000000100000002000000FD(StartupXLOG+0xac3)[0x556a4546dae3]
postgres: startup recovering
0000000100000002000000FD(StartupProcessMain+0xe8)[0x556a45693558]
postgres: startup recovering
0000000100000002000000FD(+0x3ba95d)[0x556a4568c95d]
postgres: startup recovering
0000000100000002000000FD(+0x3bce41)[0x556a4568ee41]
postgres: startup recovering
0000000100000002000000FD(PostmasterMain+0x116e)[0x556a4568eaae]
postgres: startup recovering
0000000100000002000000FD(+0x2f960e)[0x556a455cb60e]
/lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f10ef042c8a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f10ef042d45]
postgres: startup recovering
0000000100000002000000FD(_start+0x21)[0x556a453af011]
2024-08-29 21:44:32.641 EEST [668324] LOG: startup process (PID 668327)
was terminated by signal 6: Aborted
2024-08-29 21:44:32.641 EEST [668324] LOG: terminating any other active
server processes
2024-08-29 21:44:32.654 EEST [668324] LOG: shutting down due to startup
process failure
2024-08-29 21:44:32.729 EEST [668324] LOG: database system is shut down

Granted, if you restart the server, it will probably succeed because
restarting the server will kill all the other queries that were holding
locks. But yuck. With assertions disabled, it looks a little less scary,
but not nice anyway.

So how to improve this? I see a few options:

a) Downgrade the error at startup to a warning, and allow starting the
standby with smaller settings in standby. At least with a smaller
max_locks_per_transactions. The other settings also affect the size of
known-assigned XIDs array, but if the CSN snapshots get committed, that
will get fixed. In most cases there is enough lock memory anyway, and it
will be fine. Just fix the assertion failure so that the error message
is a little nicer.

b) If you run out of lock space, kill running queries, and prevent new
ones from starting. Track the locks in startup process' private memory
until there is enough space in the lock manager, and then re-open for
queries. In essence, go from hot standby mode to warm standby, until
it's possible to go back to hot standby mode again.

Thoughts, better ideas?

[1] https://commitfest.postgresql.org/49/4912/

--
Heikki Linnakangas
Neon (https://neon.tech)

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-08-29 19:25:11 Re: Pgstattuple on Sequences: Seeking Community Feedback on Potential Patch
Previous Message Pavel Stehule 2024-08-29 18:50:33 Re: [PATCH] Add CANONICAL option to xmlserialize