Re: [HACKERS] make async slave to wait for lsn to be replayed

From: Kartyshov Ivan <i(dot)kartyshov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [HACKERS] make async slave to wait for lsn to be replayed
Date: 2024-03-22 19:21:20
Message-ID: 8b67e6514c9ac3562ed675393208506a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for your feedback.

On 2024-03-20 12:11, Alexander Korotkov wrote:
> On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
> <i(dot)kartyshov(at)postgrespro(dot)ru> wrote:
>> > 4.2 With an unreasonably high future LSN, BEGIN command waits
>> > unboundedly, shouldn't we check if the specified LSN is more than
>> > pg_last_wal_receive_lsn() error out?
>
> I think limiting wait lsn by current received lsn would destroy the
> whole value of this feature. The value is to wait till given LSN is
> replayed, whether it's already received or not.

Ok sounds reasonable, I`ll rollback the changes.

> But I don't see a problem here. On the replica, it's out of our
> control to check which lsn is good and which is not. We can't check
> whether the lsn, which is in future for the replica, is already issued
> by primary.
>
> For the case of wrong lsn, which could cause potentially infinite
> wait, there is the timeout and the manual query cancel.

Fully agree with this take.

>> > 4.3 With an unreasonably high wait time, BEGIN command waits
>> > unboundedly, shouldn't we restrict the wait time to some max
> value,
>> > say a day or so?
>> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
>> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000;
>>
>> Good idea, I put it 1 day. But this limit we should to discuss.
>
> Do you think that specifying timeout in milliseconds is suitable? I
> would prefer to switch to seconds (with ability to specify fraction of
> second). This was expressed before by Alexander Lakhin.

It sounds like an interesting idea. Please review the result.

>> > https://github.com/macdice/redo-bench or similar tools?
>
> Ivan, could you do this?

Yes, test redo-bench/crash-recovery.sh
This patch on master
91.327, 1.973
105.907, 3.338
98.412, 4.579
95.818, 4.19

REL_13-STABLE
116.645, 3.005
113.212, 2.568
117.644, 3.183
111.411, 2.782

master
124.712, 2.047
117.012, 1.736
116.328, 2.035
115.662, 1.797

Strange behavior, patched version is faster then REL_13-STABLE and
master.

> I don't see this change in the patch. Normally if a process gets a
> signal, that causes WaitLatch() to exit immediately. It also exists
> immediately on query cancel. IIRC, this 1 minute timeout is needed to
> handle some extreme cases when an interrupt is missing. Other places
> have it equal to 1 minute. I don't see why we should have it
> different.

Ok, I`ll rollback my changes.

>> 4) added and expanded sections in the documentation
>
> I don't see this in the patch. I see only a short description in
> func.sgml, which is definitely not sufficient. We need at least
> everything we have in the docs before to be adjusted with the current
> approach of procedure.

I didn't find another section where to add the description of
pg_wait_lsn().
So I extend description on the bottom of the table.

>> 5) add default variant of timeout
>> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
>> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
>
> Does zero here mean no timeout? I think this should be documented.
> Also, I would prefer to see the timeout by default. Probably one
> minute would be good for default.

Lets discuss this point. Loop in function WaitForLSN is made that way,
if we choose delay=0, only then we can wait infinitely to wait LSN
without timeout. So default must be 0.

Please take one more look on the patch.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachment Content-Type Size
v14_0001-Procedure-wait-lsn.patch text/x-diff 18.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-03-22 19:31:22 Re: Cannot find a working 64-bit integer type on Illumos
Previous Message Robert Haas 2024-03-22 19:20:41 Re: documentation structure