Re: Allow logical failover slots to wait on synchronous replication

From: John H <johnhyvr(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Allow logical failover slots to wait on synchronous replication
Date: 2024-08-26 19:28:06
Message-ID: CA+-JvFtb_4LuObXxY24V98jwLwHmz0ZA7R2h1xrO8P0AQz+eeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Shveta, Amit,

> > > > ... We should try to
> > > > find out if there is a performance benefit with the use of
> > > > synchronous_standby_names in the normal configurations like the one
> > > > you used in the above tests to prove the value of this patch.

I don't expect there to be a performance benefit, if anything I would
expect it to perform
slightly worse because of the contention on SyncRepLock. The main
value of the patch
for me is it makes it easy for administrators to set the parameter and
avoid having to
re-toggle configuration if they want very up-to-date logical clients
when one of the
replicas they previously specified in 'synchronized_standby_slots ' starts being
unavailable in a synchronous configuration setup.

> > > I didn't fully understand the parameters mentioned above, specifically
> > > what 'latency stddev' and 'latency average' represent

If I understand correctly, latency is just representing the average latency of
each transaction from commit, while stddev is the standard deviation of these
transactions.

> > Yes, I also expect the patch should perform better in such a scenario
> > but it is better to test it. Also, irrespective of that, we should
> > investigate why the reported case is slower for
> > synchronous_standby_names and see if we can improve it.

We could test it but I'm not sure how interesting it is since depending
on how much the chosen slot in 'synchronized_standby_slots' lags behind
we can easily show that this patch will perform better.

For instance, in Shveta's suggestion of

> > > We can perform this test with both of the below settings and say make
> > > D and E slow in sending responses:
> > > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot.

if the server associated with E_slot is just down or undergoing
some sort of maintenance, then all logical consumers would start lagging until
the server is back up. I could also mimic a network lag of 20 seconds
and it's guaranteed
that this patch will perform better.

I re-ran the benchmarks with a longer run time of 3 hours, and testing
a new shared cache
for walsenders to check the value before obtaining the SyncRepLock.

I also saw I was being throttled on storage in my previous benchmarks
so I moved to a new setup.
I benchmarked a new test case with an additional shared cache between
all the walsenders to
reduce potential contention on SyncRepLock, and have attached said patch.

Database: Writer on it's own disk, 5 RRs on the other disk together
Client: 10 logical clients, pgbench running from here as well

'pgbench -c 32 -j 4 -T 10800 -U "ec2-user" -d postgres -r -P 1'

# Test failover_slots with synchronized_standby_slots = 'rr_1, rr_2,
rr_3, rr_4, rr_5'
latency average = 10.683 ms
latency stddev = 11.851 ms
initial connection time = 145.876 ms
tps = 2994.595673 (without initial connection time)

# Test failover_slots waiting on sync_rep no new shared cache
latency average = 10.684 ms
latency stddev = 12.247 ms
initial connection time = 142.561 ms
tps = 2994.160136 (without initial connection time)
statement latencies in milliseconds and failures:

# Test failover slots with additional shared cache
latency average = 10.674 ms
latency stddev = 11.917 ms
initial connection time = 142.486 ms
tps = 2997.315874 (without initial connection time)

The tps improvement between no cache and shared_cache seems marginal, but we do
see the slight improvement in stddev which makes sense from a
contention perspective.
I think the cache would demonstrate a lot more improvement if we had
say 1000 logical slots
and all of them are trying to obtain SyncRepLock for updating its values.

I've attached the patch but don't feel particularly strongly about the
new shared LSN values.

Thanks,

--
John Hsu - Amazon Web Services

Attachment Content-Type Size
0003-Wait-on-synchronous-replication-by-default-for-logic.patch application/octet-stream 24.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-08-26 19:28:53 Re: allowing extensions to control planner behavior
Previous Message John H 2024-08-26 19:25:51 Re: Allow logical failover slots to wait on synchronous replication