From: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
---|---|
To: | SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers) |
Date: | 2022-02-25 15:01:37 |
Message-ID: | CALj2ACVUa8WddVDS20QmVKNwTbeOQqy4zy59NPzh8NnLipYZGw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jan 6, 2022 at 1:29 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram(at)gmail(dot)com> wrote:
>
> Consider a cluster formation where we have a Primary(P), Sync Replica(S1), and multiple async replicas for disaster recovery and read scaling (within the region and outside the region). In this setup, S1 is the preferred failover target in an event of the primary failure. When a transaction is committed on the primary, it is not acknowledged to the client until the primary gets an acknowledgment from the sync standby that the WAL is flushed to the disk (assume synchrnous_commit configuration is remote_flush). However, walsenders corresponds to the async replica on the primary don't wait for the flush acknowledgment from the primary and send the WAL to the async standbys (also any logical replication/decoding clients). So it is possible for the async replicas and logical client ahead of the sync replica. If a failover is initiated in such a scenario, to bring the formation into a healthy state we have to either
>
> run the pg_rewind on the async replicas for them to reconnect with the new primary or
> collect the latest WAL across the replicas and feed the standby.
>
> Both these operations are involved, error prone, and can cause multiple minutes of downtime if done manually. In addition, there is a window where the async replicas can show the data that was neither acknowledged to the client nor committed on standby. Logical clients if they are ahead may need to reseed the data as no easy rewind option for them.
>
> I would like to propose a GUC send_Wal_after_quorum_committed which when set to ON, walsenders corresponds to async standbys and logical replication workers wait until the LSN is quorum committed on the primary before sending it to the standby. This not only simplifies the post failover steps but avoids unnecessary downtime for the async replicas. Thoughts?
Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.
I've done pgbench testing to see if the patch causes any problems. I
ran tests two times, there isn't much difference in the txns per
seconds (tps), although there's a delay in the async standby receiving
the WAL, after all, that's the feature we are pursuing.
[1]
HEAD or WITHOUT PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 247.395 ms
latency stddev = 74.409 ms
initial connection time = 13.622 ms
tps = 39.713114 (without initial connection time)
PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 251.757 ms
latency stddev = 72.846 ms
initial connection time = 13.025 ms
tps = 39.315862 (without initial connection time)
TEST SETUP:
primary in region 1
async standby 1 in the same region as that of the primary region 1
i.e. close to primary
sync standby 1 in region 2
sync standby 2 in region 3
an archive location in a region different from the primary and
standbys regions, region 4
Note that I intentionally kept sync standbys in regions far from
primary because it allows sync standbys to receive WAL a bit late by
default, which works well for our testing.
PGBENCH SETUP:
./psql -d postgres -c "drop database testdb"
./psql -d postgres -c "create database testdb"
./pgbench -i -s 100 testdb
./psql -d testdb -c "\dt"
./psql -d testdb -c "SELECT pg_size_pretty(pg_database_size('testdb'))"
./pgbench -c 10 -t 500 -P 10 testdb
Regards,
Bharath Rupireddy.
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Allow-async-standbys-wait-for-sync-replication.patch | application/octet-stream | 11.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Nitin Jadhav | 2022-02-25 15:07:28 | Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs) |
Previous Message | Fabrízio de Royes Mello | 2022-02-25 14:58:48 | Size functions inconsistent results |