From: | Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> |
Cc: | Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org |
Subject: | Re: logical replication - still unstable after all these months |
Date: | 2017-05-29 11:14:21 |
Message-ID: | ded92771-9ce9-8a82-6ed5-1b5c834ce77d@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 29/05/17 03:33, Jeff Janes wrote:
> On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood
> <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz <mailto:mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>>
> wrote:
>
> The framework ran 600 tests last night, and I see 3 'NOK' results,
> i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given
> the way the test decides on failure (gets tired of waiting for the
> table md5's to match) - it begs the question 'What if it had waited
> a bit longer'? However from what I can see in all cases:
>
> - the rowcounts were the same in master and replica
> - the md5 of pgbench_accounts was different
>
>
> All four tables should be wrong if there is still a transaction it is
> waiting for, as all the changes happen in a single transaction.
Not necessarily, if the bug is in the sync worker or in the sync to
apply handover code (which is one of the more complicated parts of all
of the logical replication, so it's prime candidate) then it can easily
be just one table.
> I also got a failure, after 87 iterations of a similar test case. It
> waited for hours, as mine requires manual intervention to stop waiting.
> On the subscriber, one account still had a zero balance, while the
> history table on the subscriber agreed with both history and accounts on
> the publisher and the account should not have been zero, so definitely a
> transaction atomicity got busted.
I am glad others are able to reproduce this, my machine is still at 0
failures after 800 cycles.
>
> What would you want to look at? Would saving the WAL from the master be
> helpful?
>
Useful info is, logs from provider (mainly the logical decoding logs
that mention LSNs), logs from subscriber (the lines about when sync
workers finished), contents of the pg_subscription_rel (with srrelid
casted to regclass so we know which table is which), and pg_waldump
output around the LSNs found in the logs and in the pg_subscription_rel
(+ few lines before and some after to get context). It's enough to only
care about LSNs for the table(s) that are out of sync.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Dave Cramer | 2017-05-29 13:11:38 | question about replication docs |
Previous Message | Jeevan Ladhe | 2017-05-29 11:13:43 | Re: fix side-effect in get_qual_for_list() |