Re: logical replication and PANIC during shutdown checkpoint in publisher

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication and PANIC during shutdown checkpoint in publisher
Date: 2017-06-02 01:14:31
Message-ID: 20170602011431.grwjm3e5kdrfkihd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-06-02 10:05:21 +0900, Michael Paquier wrote:
> On Fri, Jun 2, 2017 at 9:29 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2017-06-02 08:38:51 +0900, Michael Paquier wrote:
> >> On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler.
> >> > Normally INT is used cancel interrupts, and since walsender is now also
> >> > working as a normal backend, this overlap is bad. Even for plain
> >> > walsender backend this seems bad, because now non-superusers replication
> >> > users will terminate replication connections when they do
> >> > pg_cancel_backend(). For replication=dbname users it's especially bad
> >> > because there can be completely legitimate uses of pg_cancel_backend().
> >>
> >> Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN
> >> for SIGINT now in ~9.6, and StatementCancelHandler does not get set up
> >> for a non-am_walsender backend. Am I missing something?
> >
> > Yes, but nothing in those observeration actually addresses my point?
>
> I am still confused by your previous email, which, at least it seems
> to me, implies that logical WAL senders have been working correctly
> with query cancellations. Now SIGINT is just ignored, which means that
> pg_cancel_backend() has never worked for WAL senders until now, and
> this behavior has not changed with 086221c. So there is no new
> breakage introduced by this commit. I get your point to reuse SIGINT
> for query cancellations though, but that's a new feature.

The issue is that the commit made a non-existant feature
(pg_cancel_backend() to walsenders) into a broken one (pg_cancel_backend
terminates walsenders). Additionally v10 added something new
(walsenders executing SQL), and that will need at least some signal
handling fixes - hard to do if e.g. SIGINT is reused for something else.

> > a) upon shutdown checkpointer (so we can use procsignal), not
> > postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so
> > we don't have to use up a normal signal handler)
>
> You'll need a second one that wakes up the latch of the WAL senders to
> send more WAL records.

Don't think so, procsignal_sigusr1_handler serves quite well for that.
There's nearby discussion that we need to do so anyway, to fix recovery
conflict interrupts, parallelism interrupts and such.

> > b) Upon reception walsenders immediately exit if !replication_active,
> > otherwise sets got_STOPPING
>
> Okay, that's what happens now anyway, any new replication command
> received results in an error. I actually prefer the way of doing in
> HEAD, which at least reports an error.

Err, no. What happens right now is that plainly nothing is done if a
connection is idle or busy executing things. Only if a new command is
sent we error out - that makes very little sense.

> > c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as
> > currently done, moving to WALSNDSTATE_STOPPING. Not yet quite sure
> > how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop().
>
> Wouldn't it make sense to have the logical receivers be able to
> receive WAL up to the end of checkpoint record?

Yea, that's what I'm doing. For that we really only need to change the
check in WalSndWaitForWal() check of got_SIGINT to got_STOPPING, and add
a XLogSendLogical() check in the WalSndCaughtUp if() that sets
got_SIGUSR2 *without* setting WALSNDSTATE_STOPPING (otherwise we'd
possibly continue to trigger wal records until the send buffer is
emptied).

- Andres

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-06-02 01:20:40 Re: "create publication..all tables" ignore 'partition not supported' error
Previous Message Michael Paquier 2017-06-02 01:13:54 Re: [JDBC] Channel binding support for SCRAM-SHA-256