Allow logical failover slots to wait on synchronous replication

From: John H <johnhyvr(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Allow logical failover slots to wait on synchronous replication
Date: 2024-06-10 22:51:05
Message-ID: CA+-JvFtq6f7+wAwSdud-x0yMTeMejUhpkyid1Xa_VNpRd_-oPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Building on bf279ddd1c, this patch introduces a GUC
'standby_slot_names_from_syncrep' which allows logical failover slots
to wait for changes to have been synchronously replicated before sending
the decoded changes to logical subscribers.

The existing 'standby_slot_names' isn't great for users who are running
clusters with quorum-based synchronous replicas. For instance, if
the user has synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a
bit tedious to have to reconfigure the standby_slot_names to set it to
the most updated 3 sync replicas whenever different sync replicas start
lagging. In the event that both GUCs are set, 'standby_slot_names' takes
precedence.

I did some very brief pgbench runs to compare the latency. Client instance
was running pgbench and 10 logical clients while the Postgres box hosted
the writer and 5 synchronous replicas.

There's a hit to TPS, which I'm thinking is due to more contention on the
SyncRepLock, and that scales with the number of logical walsenders. I'm
guessing we can avoid this if we introduce another set of
lsn[NUM_SYNC_REP_WAIT_MODE] and have the logical walsenders check
and wait on that instead but I wasn't sure if that's the right approach.

pgbench numbers:

// Empty standby_slot_names_from_syncrep
query mode: simple
number of clients: 8
number of threads: 8
maximum number of tries: 1
duration: 1800 s
number of transactions actually processed: 1720173
number of failed transactions: 0 (0.000%)
latency average = 8.371 ms
initial connection time = 7.963 ms
tps = 955.651025 (without initial connection time)

// standby_slot_names_from_syncrep = 'true'
scaling factor: 200
query mode: simple
number of clients: 8
number of threads: 8
maximum number of tries: 1
duration: 1800 s
number of transactions actually processed: 1630105
number of failed transactions: 0 (0.000%)
latency average = 8.834 ms
initial connection time = 7.670 ms
tps = 905.610230 (without initial connection time)

Thanks,

--
John Hsu - Amazon Web Services

Attachment Content-Type Size
0001-Introduce-a-new-guc-standby_slot_names_from_syncrep.patch application/octet-stream 21.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-06-10 23:22:46 Re: CheckMyDatabase some error messages in two lines.
Previous Message Robert Haas 2024-06-10 21:58:13 Re: Track the amount of time waiting due to cost_delay