RE: Potential data loss due to race condition during logical replication slot creation

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Callahan, Drew" <callaan(at)amazon(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: Potential data loss due to race condition during logical replication slot creation
Date: 2024-03-27 10:37:02
Message-ID: TYCPR01MB12077A67B15F682BC4DC835E4F5342@TYCPR01MB12077.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear Sawada-san,

>
> With the PoC patch, we check ondisk.builder.is_there_running_xact in
> SnapBuildRestore(),

Yes, the PoC requires that the state of snapshot in the file must be read.

> but can we just check running->xcnt in
> SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is,
> if builder->initial_xmin_horizon is valid (or
> builder->finding_start_point is true) and running->xcnt > 0, we skip
> the snapshot restore.

IIUC, it does not require modifications of API. It may be an advantage.

> However, I think there are still cases where we
> unnecessarily skip snapshot restores
>
> Probably, what we would like to avoid is, we compute
> initial_xmin_horizon and start to find the initial start point while
> there is a concurrently running transaction, and then jump to the
> consistent state by restoring the consistent snapshot before the
> concurrent transaction commits.

Yeah, information before concurrent txns are committed should not be used. I think
that's why SnapBuildWaitSnapshot() waits until listed transactions are finished.

> So we can ignore snapshot restores if
> (oldest XID among transactions running at the time of
> CreateInitDecodingContext()) >= (OldestRunningXID in
> xl_running_xacts).
>
> I've drafted this idea in the attached patch just for discussion.

Thanks for sharing the patch. At least I confirmed all tests and workload you
pointed out in [1] were passed. I will post here if I found other issues.

[1]: https://www.postgresql.org/message-id/CAD21AoDzLY9vRpo%2Bxb2qPtfn46ikiULPXDpT94sPyFH4GE8bYg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-03-27 12:42:33 BUG #18410: SQL Error [XX000]: ERROR: variable not found in subplan target list
Previous Message Daniel Gustafsson 2024-03-27 08:58:49 Re: BUG #18409: After my windows update, I can not run postgre 16 server