Re: Single transaction in the tablesync worker?

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>
Subject: Re: Single transaction in the tablesync worker?
Date: 2020-12-21 09:35:21
Message-ID: CAHut+Pv2dWz+SkOUm-ERzOoY1PMRv0mejsT98O6oNCraKVH6xg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 19, 2020 at 5:38 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Dec 18, 2020 at 6:41 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > TODO / Known Issues:
> >
> > * the current implementation of tablesync drop slot (e.g. from
> > DropSubscription or finish_sync_worker) regenerates the tablesync slot
> > name so it knows what slot to drop.
> >
>
> If you always drop the slot at finish_sync_worker, then in which case
> do you need to drop it during DropSubscription? Is it when the table
> sync workers are crashed?

Yes. It is not the normal case. But if the tablesync never yet got to
SYNCDONE state (maybe crashed) then finish_sync_worker may not be
called.
So I think a rogue tablesync slot might still exist during DropSubscription.

>
> > The current code might be ok for
> > normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
> > (slot_name = newname) it would fail to be able to find the tablesync
> > slot.
> >
>
> Sure, but the same will be true for the apply worker slot as well. I
> agree the problem would be more for table sync workers but I think we
> can solve it, see below.
>
> > * I think if there are crashed tablesync workers then they are not
> > known to DropSubscription. So this might be a problem to cleanup slots
> > and/or origin tracking belonging to those unknown workers.
> >
>
> Yeah, I think we can do two things to avoid this and the previous
> problem. (a) We can generate the slot_name for the table sync worker
> based on only subscription_id and rel_id. (b) Immediately after
> creating the slot, advance the replication origin with the position
> (origin_startpos) we get from walrcv_create_slot, this will help us to
> start from the right location.
>
> Do you see anything which will still not be addressed after doing the above?

(a) V5 Patch is updated as suggested.
(b) V5 Patch is updated as suggested. Now calling replorigin_advance.
No problems seen so far. All TAP tests pass, but more testing needed
for the origin stuff

>
> I understand why you are trying to create this patch atop logical
> decoding of 2PC patch but I think it is better to create this as an
> independent patch and then use it to test 2PC problem.

OK. The latest patch still applies to v30 just for my convenience
today, but I will head towards converting this to an independent patch
ASAP.

> Also, please
> explain what kind of testing you did to ensure that it works properly
> after the table sync worker restarts after the crash.

So far tested like this - I caused the tablesync to crash after
COPYDONE (but before SYNCDONE) by sending a row to cause a PK
violation while holding the tablesync at the CATCHUP state in the
debugger. The tablesync then handles the insert, encounters the PK
violation error, and re-launches. Then I can remove the extra row so
the PK violation does not happen, so the (re-launched) tablesync can
complete and finish normally. The Apply worker then takes over.

I have attached some captured/annotated logging of my test scenario
which I ran using the V4 patch (the log has a lot of extra temporary
output to help see what is going on)

---
Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachment Content-Type Size
Test-20201218-tablesync-finds-COPYDONE-C.txt text/plain 26.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2020-12-21 09:39:04 Re: [PATCH] Logical decoding of TRUNCATE
Previous Message Peter Smith 2020-12-21 09:15:12 Re: Single transaction in the tablesync worker?