Re: logical replication - still unstable after all these months

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: logical replication - still unstable after all these months
Date: 2017-06-01 22:46:22
Message-ID: 389d0619-b35d-b349-5303-c82723dfdf84@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31/05/17 21:16, Petr Jelinek wrote:

> On 29/05/17 23:06, Mark Kirkwood wrote:
>> On 29/05/17 23:14, Petr Jelinek wrote:
>>
>>> On 29/05/17 03:33, Jeff Janes wrote:
>>>
>>>> What would you want to look at? Would saving the WAL from the master be
>>>> helpful?
>>>>
>>> Useful info is, logs from provider (mainly the logical decoding logs
>>> that mention LSNs), logs from subscriber (the lines about when sync
>>> workers finished), contents of the pg_subscription_rel (with srrelid
>>> casted to regclass so we know which table is which), and pg_waldump
>>> output around the LSNs found in the logs and in the pg_subscription_rel
>>> (+ few lines before and some after to get context). It's enough to only
>>> care about LSNs for the table(s) that are out of sync.
>>>
>> I have a run that aborted with failure (accounts table md5 mismatch).
>> Petr - would you like to have access to the machine ? If so send me you
>> public key and I'll set it up.
> Thanks to Mark's offer I was able to study the issue as it happened and
> found the cause of this.
>
> The busy loop in apply stops at the point when worker shmem state
> indicates that table synchronization was finished, but that might not be
> visible in the next transaction if it takes long to flush the final
> commit to disk so we might ignore couple of transactions for given table
> in the main apply because we think it's still being synchronized. This
> also explains why I could not reproduce it on my testing machine (fast
> ssd disk array where flushes were always fast) and why it happens
> relatively rarely because it's one specific commit during the whole
> synchronization process that needs to be slow.
>
> So as solution I changed the busy loop in the apply to wait for
> in-catalog status rather than in-memory status to make sure things are
> really there and flushed.
>
> While working on this I realized that the handover itself is bit more
> complex than necessary (especially for debugging and for other people
> understanding it) so I did some small changes as part of this patch to
> make the sequences of states table goes during synchronization process
> to always be the same. This might cause unnecessary update per one table
> synchronization process in some cases but that seems like small enough
> price to pay for clearer logic. And it also fixes another potential bug
> that I identified where we might write wrong state to catalog if main
> apply crashed while sync worker was waiting for status update.
>
> I've been running tests on this overnight on another machine where I was
> able to reproduce the original issue within few runs (once I found what
> causes it) and so far looks good.
>
>
>

I'm seeing a new failure with the patch applied - this time the history
table has missing rows. Petr, I'll put back your access :-)

regards

Mark

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-06-01 23:38:51 Re: logical replication and PANIC during shutdown checkpoint in publisher
Previous Message Tom Lane 2017-06-01 22:20:50 Re: [PATCH] quiet conversion warning in DatumGetFloat4