Quick Links

Re: Potential data loss due to race condition during logical replication slot creation

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	"Callahan, Drew" <callaan(at)amazon(dot)com>
Cc:	"pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Potential data loss due to race condition during logical replication slot creation
Date:	2024-02-02 00:53:33
Message-ID:	CAD21AoC8cKeRyQWtF0GCXtf=B_ry7EvDvnH1ZRuEm9Ld79er9Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi,

Thank you for the report!

On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan(at)amazon(dot)com> wrote:
>
> Hello,
>
> We discovered a race condition during logical replication slot creation that can result in the changes for transactions running at the time of the slot creation to only be partially replicated. We found the cause was due to the slot transitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshot that had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below:
>
> Session 1:
>
> SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
>
> CREATE TABLE test (a int);
> BEGIN;
> INSERT INTO test VALUES (1);
>
> Session 2:
>
> SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
>
> <query hangs>
>
> Session 3:
>
> CHECKPOINT;
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> <should return nothing of interest>
>
> Session 1:
>
> INSERT INTO test VALUES (2);
> COMMIT;
>
> <Session 2 query no longer hangs and successfully creates the slot2>
>
> Session 2:
>
> select pg_logical_slot_get_changes('slot1', NULL, NULL);
>
> select pg_logical_slot_get_changes('slot2', NULL, NULL);
>
> <expected: no rows of the txn are returned for slot2>
> <actual: The 2nd row of the txn is returned for slot2>
>

I've confirmed this happens HEAD and all supported backbranches. And I
think it's a (long-standing) bug.

I think we can prepare an isolation test of the above steps and
include it in contrib/test_decoding/specs.

> Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL and also force a checkpoint to get the current state of the running transactions on the system. This create process will then wait for all of the transactions within that running xact record to complete before being able to transition to the next snapbuild state. During this time period, if another running xact record is written and then a different logical replication process decodes this running xact record, a globally accessible snapshot will be persisted to disk.
>
> Once all of the transactions from the initial running xact have finished, the process performing the slot creation will become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore that snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flush LSN of the slot to the start of the next record after that running xact.

My analysis is the same. After the slot creation, the slot's LSNs become like:

(tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further
some changes) < slot's confirmed_flush_lsn < (tx-commit)

> We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit after our confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactions commit, we will then partially stream their changes.
>
> The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it is building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot” parameter which is already set by the caller to control when and how the snapbuilder is allowed to become consistent.
>
> With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent. Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fully consistent we can prevent the issue by disallowing this behavior.

I basically agree with the basic idea of skipping snapshot restore
while finding the initial start point. But I'd like to hear other
opinions too.

Another simple way to fix this is that we always set
need_full_snapshot to true when slot creation. But it would then make
the snapshot builder include non-catalog-change transactions too while
finding the initial startpoint. Which is not necessary.

As for the proposed patch, it uses a global variable, InCreate, and
memory context callback to reset it. While I believe this is for not
breaking any ABI as this bug exists on back branches too, I think it's
better to consider how to fix it in HEAD and then how to backpatch it.
We don't necessarily need to have the same fixes for all branches all
the time.

IIUC we need to skip snapshot restore only when we're finding the
initial start point. So probably we don't need to set InCreate when
calling create_logical_replication_slot() with find_startpoint =
false.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Potential data loss due to race condition during logical replication slot creation at 2024-02-01 22:28:44 from Callahan, Drew

Responses

Re: Potential data loss due to race condition during logical replication slot creation at 2024-02-03 13:48:19 from vignesh C

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Rintaro.Ikeda	2024-02-02 03:10:03	Undetected deadlock between primary and standby processes
Previous Message	Peter Smith	2024-02-02 00:35:41	Re: BUG #18319: Logical Replication updates causing duplication of row if evaluation filter is set to the same field