Re: Synchronous commit behavior during network outage

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Ondřej Žižka <ondrej(dot)zizka(at)stratox(dot)cz>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Synchronous commit behavior during network outage
Date: 2021-06-30 12:28:28
Message-ID: 8848B234-F534-44BE-9EE8-43BC6D28B297@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> 29 июня 2021 г., в 23:35, Jeff Davis <pgsql(at)j-davis(dot)com> написал(а):
>
> On Tue, 2021-06-29 at 11:48 +0500, Andrey Borodin wrote:
>>> 29 июня 2021 г., в 03:56, Jeff Davis <pgsql(at)j-davis(dot)com>
>>> написал(а):
>>>
>>> The patch may be somewhat controversial, so I'll wait for feedback
>>> before documenting it properly.
>>
>> The patch seems similar to [0]. But I like your wording :)
>> I'd be happy if we go with any version of these idea.
>
> Thank you, somehow I missed that one, we should combine the CF entries.
>
> My patch also covers the backend termination case. Is there a reason
> you left that case out?
Yes, backend termination is used by HA tool before rewinding the node. Initially I was considering termination as PANIC and got a ton of coredumps during failovers on drills.

There is one more caveat we need to fix: we should prevent instant recovery from happening. HA tool must know that our process was restarted.
Consider following scenario:
1. Node A is primary with sync rep.
2. A is going through network partitioning, somewhere node B is promoted.
3. All backends of A are stuck in sync rep, until HA tool discovers A is failed node.
4. One backend crashes with segfault in some buggy extension or OOM or whatever
5. Postgres server is doing restartless crash recovery making local-but-not-replicated data visible.

We should prevent 5 also as we prevent cancels. HA tool will discover postmaster fail and will recheck in coordinatino system that it can raise up Postgres locally.

Thanks!

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-06-30 12:30:22 Re: cleaning up PostgresNode.pm
Previous Message David Rowley 2021-06-30 12:24:19 Re: Use pg_nextpower2_* in a few more places