Quick Links

RE: Data is copied twice when specifying both child and parent table in publication

From:	"wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>
Cc:	"houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject:	RE: Data is copied twice when specifying both child and parent table in publication
Date:	2022-04-24 06:16:04
Message-ID:	OS3PR01MB6275CC45E9BC9297261865559EF99@OS3PR01MB6275.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thur, Apr 21, 2022 at 5:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
Thanks for your comments.

> On Tue, Apr 19, 2022 at 2:23 PM shiy(dot)fnst(at)fujitsu(dot)com <shiy(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> > On Tue, Apr 19, 2022 3:05 PM houzj(dot)fnst(at)fujitsu(dot)com
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > One suggestion is that can we simplify the code by moving the logic
> > > of checking the ancestor into the SQL ?. For example, we could
> > > filter the outpout of pg_publication_tables by adding A WHERE clause
> > > which checks whether the table is a partition and if its ancestor is
> > > also in the output. I think we can also filter the needless partition in this
> approach.
> > >
> >
> > I agreed with you and I tried to fix this problem in a simpler way.
> > What we want is to exclude the partitioned table whose ancestor is
> > also need to be replicated, so how about implementing that by using
> > the following SQL when getting the table list from publisher?
> >
> > SELECT DISTINCT ns.nspname, c.relname
> > FROM pg_catalog.pg_publication_tables t JOIN pg_catalog.pg_namespace
> > ns ON ns.nspname = t.schemaname JOIN pg_catalog.pg_class c ON
> > c.relname = t.tablename AND c.relnamespace = ns.oid WHERE t.pubname IN
> > ('p0','p2') AND (c.relispartition IS FALSE OR NOT EXISTS (SELECT 1
> > FROM pg_partition_ancestors(c.oid) WHERE relid IN ( SELECT DISTINCT
> > (schemaname||'.'||tablename)::regclass::oid
> > FROM pg_catalog.pg_publication_tables t WHERE t.pubname IN ('p0','p2')
> > ) AND relid != c.oid));
> >
> > Please find the attached patch which used this approach, I also merged
> > the test in Wang's patch into it.
> >
>
> I think this will work but do we need "... relid != c.oid" at the end of the query? If
> so, why? Please use an alias for pg_partition_ancestors to make the statement
> understandable.
I think we need this (relid != c.oid). Because when we use function
pg_partition_ancestors(c.oid), its return value not only has ancestors, but
also the input table. That is to say, when we use the table (c.oid) of the
outer query to filter in the sub-query, the table of the outer query will also
appear in the result of the sub-query.
So, I think we need this condition to prevent filtering out itself.

> Now, this solution will work but I find this query a bit complex and will add some
> overhead as we are calling pg_publication_tables multiple times. So, I was
> wondering if we can have a new function pg_get_publication_tables which
> takes multiple publications as input and return the list of qualified tables? I think
> for back branches we need something on the lines of what you have proposed
> but for HEAD we can have a better solution.
Yes, it sounds reasonable to me. Now, to fix this bug:
In the patch for HEAD, add a new function pg_get_publications_tables to get
tables info from a publications array.
In the patch for back-branch (now just share the patch for REL14), modify the
SQL to get tables info.

> IIRC, the column list and row filter also have some issues exactly due to this
> reason, so, I would like those cases to be also mentioned here and probably
> include the tests for them in the patch for HEAD.
Improve the test case about the column list and row filter to cover this bug.

Attach the new patches.[suggestions by Amit-San]
The patch for HEAD:
1. Add a new function to get tables info by a publications array.
The patch for REL14:
1. Use an alias to make the statement understandable. BTW, I adjusted the alignment.
2. Improve the test cast about the column list and row filter to cover this bug.

Regards,
Wang wei

Attachment	Content-Type	Size
HEAD_v1-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch	application/octet-stream	12.1 KB
REL14_v3-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch	application/octet-stream	5.4 KB

In response to

Re: Data is copied twice when specifying both child and parent table in publication at 2022-04-21 09:41:00 from Amit Kapila

Responses

RE: Data is copied twice when specifying both child and parent table in publication at 2022-04-25 01:23:42 from wangw.fnst@fujitsu.com
RE: Data is copied twice when specifying both child and parent table in publication at 2022-04-28 01:22:13 from shiy.fnst@fujitsu.com

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2022-04-24 17:09:08	Re: pgsql: Allow db.schema.table patterns, but complain about random garbag
Previous Message	David Rowley	2022-04-24 05:16:25	Re: Use generation context to speed up tuplesorts