Re: long-standing data loss bug in initial sync of logical replication

From: Nitin Motiani <nitinmotiani(at)google(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: long-standing data loss bug in initial sync of logical replication
Date: 2024-07-15 19:18:19
Message-ID: CAH5HC95cbeNSwMEgofojQRV2dXr6ERK_xk6VLj6PhN9YU_FFqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 15, 2024 at 11:42 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Mon, 15 Jul 2024 at 15:31, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Jul 11, 2024 at 6:19 PM Nitin Motiani <nitinmotiani(at)google(dot)com> wrote:
> > > I looked further into the scenario of adding the tables in schema to
> > > the publication. Since in that case, the entry is added to
> > > pg_publication_namespace instead of pg_publication_rel, the codepaths
> > > for 'add table' and 'add tables in schema' are different. And in the
> > > 'add tables in schema' scenario, the OpenTableList function is not
> > > called to get the relation ids. Therefore even with the proposed
> > > patch, the data loss issue still persists in that case.
> > >
> > > To validate this idea, I tried locking all the affected tables in the
> > > schema just before the invalidation for those relations (in
> > > ShareRowExclusiveLock mode).
> > >
> >
> > This sounds like a reasonable approach to fix the issue. However, we
> > should check SET publication_object as well, especially the drop part
> > in it. It should not happen that we miss sending the data for ADD but
> > for DROP, we send data when we shouldn't have sent it.
>
> There were few other scenarios, similar to the one you mentioned,
> where the issue occurred. For example: a) When specifying a subset of
> existing tables in the ALTER PUBLICATION ... SET TABLE command, the
> tables that were supposed to be removed from the publication were not
> locked in ShareRowExclusiveLock mode. b) The ALTER PUBLICATION ...
> DROP TABLES IN SCHEMA command did not lock the relations that will be
> removed from the publication in ShareRowExclusiveLock mode. Both of
> these scenarios resulted in data inconsistency due to inadequate
> locking. The attached patch addresses these issues.
>

Hi,

A couple of questions on the latest patch :

1. I see there is this logic in PublicationDropSchemas to first check
if there is a valid entry for the schema in pg_publication_namespace

psid = GetSysCacheOid2(PUBLICATIONNAMESPACEMAP,

Anum_pg_publication_namespace_oid,

ObjectIdGetDatum(schemaid),

ObjectIdGetDatum(pubid));
if (!OidIsValid(psid))
{
if (missing_ok)
continue;

ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg("tables from schema
\"%s\" are not part of the publication",

get_namespace_name(schemaid))));
}

Your proposed change locks the schemaRels before this code block.
Would it be better to lock the schemaRels after the error check? So
that just in case, the publication on the schema is not valid anymore,
the lock is not held unnecessarily on all its tables.

2. The function publication_add_schema explicitly invalidates cache by
calling InvalidatePublicationRels(schemaRels). That is not present in
the current PublicationDropSchemas code. Is that something which
should be added in the drop scenario also? Please let me know if there
is some context that I'm missing regarding why this was not added
originally for the drop scenario.

Thanks & Regards,
Nitin Motiani
Google

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-07-15 19:28:29 Re: Compress ReorderBuffer spill files using LZ4
Previous Message Tom Lane 2024-07-15 19:15:44 Re: [PATCH v1] Fix parsing of a complex type that has an array of complex types