From: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Ondřej Žižka <ondrej(dot)zizka(at)stratox(dot)cz>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Synchronous commit behavior during network outage |
Date: | 2021-06-30 12:28:28 |
Message-ID: | 8848B234-F534-44BE-9EE8-43BC6D28B297@yandex-team.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> 29 июня 2021 г., в 23:35, Jeff Davis <pgsql(at)j-davis(dot)com> написал(а):
>
> On Tue, 2021-06-29 at 11:48 +0500, Andrey Borodin wrote:
>>> 29 июня 2021 г., в 03:56, Jeff Davis <pgsql(at)j-davis(dot)com>
>>> написал(а):
>>>
>>> The patch may be somewhat controversial, so I'll wait for feedback
>>> before documenting it properly.
>>
>> The patch seems similar to [0]. But I like your wording :)
>> I'd be happy if we go with any version of these idea.
>
> Thank you, somehow I missed that one, we should combine the CF entries.
>
> My patch also covers the backend termination case. Is there a reason
> you left that case out?
Yes, backend termination is used by HA tool before rewinding the node. Initially I was considering termination as PANIC and got a ton of coredumps during failovers on drills.
There is one more caveat we need to fix: we should prevent instant recovery from happening. HA tool must know that our process was restarted.
Consider following scenario:
1. Node A is primary with sync rep.
2. A is going through network partitioning, somewhere node B is promoted.
3. All backends of A are stuck in sync rep, until HA tool discovers A is failed node.
4. One backend crashes with segfault in some buggy extension or OOM or whatever
5. Postgres server is doing restartless crash recovery making local-but-not-replicated data visible.
We should prevent 5 also as we prevent cancels. HA tool will discover postmaster fail and will recheck in coordinatino system that it can raise up Postgres locally.
Thanks!
Best regards, Andrey Borodin.
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2021-06-30 12:30:22 | Re: cleaning up PostgresNode.pm |
Previous Message | David Rowley | 2021-06-30 12:24:19 | Re: Use pg_nextpower2_* in a few more places |