Re: Synchronizing slots from primary to standby

From: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-11-07 14:28:11
Message-ID: 5a858c63-cd59-4622-a9ca-eebbb7bd59cc@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 11/7/23 11:55 AM, Amit Kapila wrote:
> On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>>
>> On 10/31/23 10:37 AM, shveta malik wrote:
>>> On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand
>>>>
>>>> Agree with your test case, but in my case I was not using pub/sub.
>>>>
>>>> I was not clear, so when I said:
>>>>
>>>>>> - create logical_slot1 on the primary (and don't start using it)
>>>>
>>>> I meant don't start decoding from it (like using pg_recvlogical() or
>>>> pg_logical_slot_get_changes()).
>>>>
>>>> By using pub/sub the "don't start using it" is not satisfied.
>>>>
>>>> My test case is:
>>>>
>>>> "
>>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true);
>>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true);
>>>> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f -
>>>> "
>>>>
>>>
>>> Okay, I am able to reproduce it now. Thanks for clarification. I have
>>> tried to change the algorithm as per suggestion by Amit in [1]
>>>
>>> [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com
>>
>> Thanks!
>>
>>>
>>> This is not full proof solution but optimization over first one. Now
>>> in any sync-cycle, we take 2 attempts for slots-creation (if any slots
>>> are available to be created). In first attempt, we do not wait
>>> indefinitely on inactive slots, we wait only for a fixed amount of
>>> time and if remote-slot is still behind, then we add that to the
>>> pending list and move to the next slot. Once we are done with first
>>> attempt, in second attempt, we go for the pending ones and now we wait
>>> on each of them until the primary catches up.
>>
>> Aren't we "just" postponing the "issue"? I mean if there is really no activity
>> on, say, the first created slot, then once we move to the second attempt then any newly
>> created slot from that time would wait to be synced forever, no?
>>
>
> We have to wait at some point in time for such inactive slots and the
> same is true even for manually created slots on standby. Do you have
> any better ideas to deal with it?
>

What about:

- get rid of the second attempt and the pending_slot_list
- keep the wait_count and PrimaryCatchupWaitAttempt logic

so basically, get rid of:

/*
* Now sync the pending slots which were failed to be created in first
* attempt.
*/
foreach(cell, pending_slot_list)
{
RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell);

/* Wait until the primary server catches up */
PrimaryCatchupWaitAttempt = 0;

synchronize_one_slot(wrconn, remote_slot, NULL);
}

and the pending_slot_list list.

That way, for each slot that have not been created and synced yet:

- it will be created on the standby
- we will wait up to PrimaryCatchupWaitAttempt attempts
- the slot will be synced or removed on/from the standby

That way an inactive slot on the primary would not "block"
any other slots on the standby.

By "created" here I mean calling ReplicationSlotCreate() (not to be confused
with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); "
which is confusing as mentioned up-thread).

The problem I can see with this proposal is that the "sync" window waiting
for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt
attempts (as the slot will be dropped/recreated).

If we think this window is too short we could:

- increase it
or
- don't drop the slot once created (even if there is no activity
on the primary during PrimaryCatchupWaitAttempt attempts) so that
the next loop of attempts will compare with "older" LSN/xmin (as compare to
dropping and re-creating the slot). That way the window would be since the
initial slot creation.

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2023-11-07 14:33:55 Re: Infinite Interval
Previous Message Jonathan S. Katz 2023-11-07 14:02:03 Re: 2023-11-09 release announcement draft