Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Roman Eskin <r(dot)eskin(at)arenadata(dot)io>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done
Date: 2025-01-28 09:51:29
Message-ID: A950518B-4116-492B-8773-C9A5CE1620AF@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 21 Jan 2025, at 16:47, Roman Eskin <r(dot)eskin(at)arenadata(dot)io> wrote:
>
>>
>> Persisting recovery signal file for some _timeout_ seems super dangerous to me. In distributed systems every extra _timeout_ is a source of complexity, uncertainty and despair.
>
> The approach is not about persisting the signal files for some timeout. Currently the files are removed in StartupXLOG() before writeTimeLineHistory() and PerformRecoveryXLogAction() are called. The suggestion is to move the file removal after PerformRecoveryXLogAction() inside StartupXLOG().

Sending node to repeated promote-fail cycle without resolving root cause seems like even less appealing idea.
If something prevented promotion, why we should retry by this particular method?

Even in case of transient failure which you described - power loss - it does not sound like a very good idea to retry promotion after returning online. The user will get unexpected splitbrain.

Best regards, Andrey Borodin.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-01-28 09:56:13 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message Manika Singhal 2025-01-28 09:42:36 EDB Installer initcluster script changes - review requested