Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date: 2022-01-15 00:51:01
Message-ID: 20220115005101.xkfe232v3rgcj3sv@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-01-15 13:19:42 +1300, Thomas Munro wrote:
> On Sat, Jan 15, 2022 at 11:44 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > The patch Alexander tested most recently uses a tri-state eof flag [...]
> >
> > What about instead giving WalReceiverConn an internal WaitEventSet, and using
> > that consistently? I've attached a draft for that.
> >
> > Alexander, could you test with that patch applied?
>
> Isn't your patch nearly identical to one that I already posted, that
> Alexander tested and reported success with here?

Sorry, somehow I missed that across the many patches in the thread... And yes,
it does look remarkably similar.

> https://www.postgresql.org/message-id/5d507424-13ce-d19f-2f5d-ab4c6a987316%40gmail.com
>
> I can believe that fixes walreceiver

One thing that still bothers me around this is that we didn't detect the
problem of the dead walreceiver connection, even after missing the
FD_CLOSE. There's plenty other ways that a connection can get stalled for
prolonged was, that we'd not get notified about either. That's why there's
wal_receiver_timeout, after all.

But from what I can see wal_receiver_timeout doesn't work even halfway
reliably, because of all the calls down to libpqrcv_PQgetResult, where we just
block indefinitely?

This actually seems like a significant issue to me, and not just on windows.

> (if we're sure that there isn't a libpq-changes-the-socket problem)

I just don't see what that problem could be, once connection is
established. The only way a the socket fd could change is a reconnect, which
doesn't happen automatically.

I actually was searching the archives for threads on it, but I didn't find
much besides the references around [1]. And I didn't see a concrete risk
explained there?

> but AFAICS the same problem exists for postgres_fdw and async append.

Perhaps - but I suspect it'll matter far less with them than with walreceiver.

> That's why I moved to trying to
> fix the multiple-WES thing (though of course I agree we should be
> using long lived WESes wherever possible

That approach seems like a very very leaky bandaid, with a decent potential
for unintended consequences. Perhaps there's nothing better that we can do,
but I'd rather try to fix the problem closer to the root...

> I just didn't think that seemed back-patchable, so it's more of a feature
> patch for the future).

Hm, it doesn't seem crazy invasive to me. But I guess we might be looking at a
revert of the shutdown changes for now anyway? In that case we should be
fixing this anyway, but we might be able to afford doing it in master only?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2BzCNJZBXcURPdQvdY-tjyD0y7Li2wZEC6XChyUej1S5w%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-01-15 02:31:23 Re: Add 64-bit XIDs into PostgreSQL 15
Previous Message James Coleman 2022-01-15 00:42:27 Add last commit LSN to pg_last_committed_xact()