Re: Race conditions in 019_replslot_limit.pl

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-17 03:11:30
Message-ID: 3566584.1645067490@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2022-02-16 20:22:23 -0500, Tom Lane wrote:
>> There's no disconnection log entry for either, which I suppose means
>> that somebody didn't bother logging disconnection for walsenders ...

> The thing is, we actually *do* log disconnection for walsenders:

Ah, my mistake, now I do see a disconnection entry for the other walsender
launched by the basebackup.

> Starting a node in recovery and having it connect to the primary seems like a
> mighty long time for a process to exit, unless it's stuck behind something.

Fair point. Also, 019_replslot_limit.pl hasn't been changed in any
material way in months, but *something's* changed recently, because
this just started. I scraped the buildfarm for instances of
"Failed test 'have walsender pid" going back 6 months, and what I find is

sysname | branch | snapshot | stage | l
--------------+--------+---------------------+---------------+---------------------------------------------
desmoxytes | HEAD | 2022-02-15 04:42:05 | recoveryCheck | # Failed test 'have walsender pid 1685516
idiacanthus | HEAD | 2022-02-15 07:24:05 | recoveryCheck | # Failed test 'have walsender pid 2758549
serinus | HEAD | 2022-02-15 11:00:08 | recoveryCheck | # Failed test 'have walsender pid 3682154
desmoxytes | HEAD | 2022-02-15 11:04:05 | recoveryCheck | # Failed test 'have walsender pid 3775359
flaviventris | HEAD | 2022-02-15 18:03:48 | recoveryCheck | # Failed test 'have walsender pid 1517077
idiacanthus | HEAD | 2022-02-15 22:48:05 | recoveryCheck | # Failed test 'have walsender pid 2494972
desmoxytes | HEAD | 2022-02-15 23:48:04 | recoveryCheck | # Failed test 'have walsender pid 3055399
desmoxytes | HEAD | 2022-02-16 10:48:05 | recoveryCheck | # Failed test 'have walsender pid 1593461
komodoensis | HEAD | 2022-02-16 21:16:04 | recoveryCheck | # Failed test 'have walsender pid 3726703
serinus | HEAD | 2022-02-17 01:18:17 | recoveryCheck | # Failed test 'have walsender pid 208363

So (a) it broke around 48 hours ago, which is already a useful
bit of info, and (b) your animals seem far more susceptible than
anyone else's. Why do you suppose that is?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-17 03:15:47 Re: Race conditions in 019_replslot_limit.pl
Previous Message Andres Freund 2022-02-17 03:08:43 Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM