Quick Links

Re: logical replication - still unstable after all these months

From:	Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc:	Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject:	Re: logical replication - still unstable after all these months
Date:	2017-05-29 11:14:21
Message-ID:	ded92771-9ce9-8a82-6ed5-1b5c834ce77d@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 29/05/17 03:33, Jeff Janes wrote:
> On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood
> <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz <mailto:mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>>
> wrote:
>
> The framework ran 600 tests last night, and I see 3 'NOK' results,
> i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given
> the way the test decides on failure (gets tired of waiting for the
> table md5's to match) - it begs the question 'What if it had waited
> a bit longer'? However from what I can see in all cases:
>
> - the rowcounts were the same in master and replica
> - the md5 of pgbench_accounts was different
>
>
> All four tables should be wrong if there is still a transaction it is
> waiting for, as all the changes happen in a single transaction.

Not necessarily, if the bug is in the sync worker or in the sync to
apply handover code (which is one of the more complicated parts of all
of the logical replication, so it's prime candidate) then it can easily
be just one table.

> I also got a failure, after 87 iterations of a similar test case. It
> waited for hours, as mine requires manual intervention to stop waiting.
> On the subscriber, one account still had a zero balance, while the
> history table on the subscriber agreed with both history and accounts on
> the publisher and the account should not have been zero, so definitely a
> transaction atomicity got busted.

I am glad others are able to reproduce this, my machine is still at 0
failures after 800 cycles.

>
> What would you want to look at? Would saving the WAL from the master be
> helpful?
>

Useful info is, logs from provider (mainly the logical decoding logs
that mention LSNs), logs from subscriber (the lines about when sync
workers finished), contents of the pg_subscription_rel (with srrelid
casted to regclass so we know which table is which), and pg_waldump
output around the LSNs found in the logs and in the pg_subscription_rel
(+ few lines before and some after to get context). It's enough to only
care about LSNs for the table(s) that are out of sync.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: logical replication - still unstable after all these months at 2017-05-29 01:33:51 from Jeff Janes

Responses

Re: logical replication - still unstable after all these months at 2017-05-29 21:06:48 from Mark Kirkwood

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dave Cramer	2017-05-29 13:11:38	question about replication docs
Previous Message	Jeevan Ladhe	2017-05-29 11:13:43	Re: fix side-effect in get_qual_for_list()