Re: Data is copied twice when specifying both child and parent table in publication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Greg Nancarrow <gregn4422(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Data is copied twice when specifying both child and parent table in publication
Date: 2021-10-20 10:19:46
Message-ID: CAA4eK1+Y0cP+xgZuHHxvsO=hQ+Zrp4GbBRTKH6kCGq3=FfVAHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 20, 2021 at 3:03 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
>
> On Wed, Oct 20, 2021 at 7:59 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > > > Actually, at least with the scenario I gave steps for, after looking
> > > > at it again and debugging, I think that the behavior is understandable
> > > > and not a bug.
> > > > The reason is that the INSERTed data is first published though the
> > > > partitions, since initially there is no partitioned table in the
> > > > publication (so publish_via_partition_root=true doesn't have any
> > > > effect). But then adding the partitioned table to the publication and
> > > > refreshing the publication in the subscriber, the data is then
> > > > published "using the identity and schema of the partitioned table" due
> > > > to publish_via_partition_root=true. Note that the corresponding table
> > > > in the subscriber may well be a non-partitioned table (or the
> > > > partitions arranged differently) so the data does need to be
> > > > replicated again.
> > >
> >
> > Even if the partitions are arranged differently why would the user
> > expect the same data to be replicated twice?
> >
>
> It's the same data, but published in different ways because of changes
> the user made to the publication.
> I am not talking in general, I am specifically referring to the
> scenario I gave steps for.
> In the example scenario I gave, initially when the subscription was
> made, the publication just explicitly included the partitions, but
> publish_via_partition_root was true. So in this case it publishes
> through the individual partitions (as no partitioned table is present
> in the publication). Then on the publisher side, the partitioned table
> was then added to the publication and then ALTER SUBSCRIPTION ...
> REFRESH PUBLICATION done on the subscriber side. Now that the
> partitioned table is present in the publication and
> publish_via_partition_root is true, it is "published using the
> identity and schema of the partitioned table rather than that of the
> individual partitions that are actually changed". So the data is
> replicated again.
>

I don't see why data need to be replicated again even in that case.
Can you see any such duplicate data replicated for non-partitioned
tables?

> This scenario didn't use initial table data, so initial table sync
> didn't come into play
>

It will be equivalent to initial sync because the tablesync worker
would copy the entire data again in this case unless during refresh we
pass copy_data as false.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-10-20 10:47:26 Re: pgsql: Document XLOG_INCLUDE_XID a little better
Previous Message Masahiro Ikeda 2021-10-20 10:16:20 Re: LogicalChanges* and LogicalSubxact* wait events are never reported