Re: Timeout failure in 019_replslot_limit.pl

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)alvh(dot)no-ip(dot)org, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Timeout failure in 019_replslot_limit.pl
Date: 2021-09-07 03:01:06
Message-ID: 20210907.120106.1483239898065111540.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 7 Sep 2021 09:37:10 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote:
> > I scraped the buildfarm logs looking for similar failures, and didn't
> > find any. (019_replslot_limit.pl hasn't failed at all in the farm
> > since the last fix it received, in late July.)
>
> The interesting bits are in 019_replslot_limit_primary3.log. In a
> failed run, I can see that we get immediately a process termination,
> as follows:
> 2021-09-07 07:52:53.402 JST [22890] LOG: terminating process 23082 to release replication slot "rep3"
> 2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL: terminating connection due to administrator command
> 2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
>
> In a successful run, the pattern is different:
> 2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL: terminating connection due to administrator command
> 2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
> 2021-09-07 09:27:39.832 JST [57092] LOG: invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds max_slot_wal_keep_size
> 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoint complete: wrote
> 19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled;
> write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB
> 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoints are occurring too frequently (0 seconds apart)
> 2021-09-07 09:27:39.833 JST [57092] HINT: Consider increasing the configuration parameter "max_wal_size".
> 2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
>
> The slot invalidation is forgotten because we don't complete a
> checkpoint that does the work it should do, no? There is a completed
> checkpoint before we query pg_replication_slots, and the buildfarm
> shows the same thing.

It seems like the "kill 'STOP'" in the script didn't suspend the
processes before advancing WAL. The attached uses 'ps' command to
check that since I didn't come up with the way to do the same in Perl.

I'm still not sure it works as expected, though. (Imagining the case
where the state changes before the process actually stops..)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
make_019_to_wait_to_stop.diff text/x-patch 1.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-09-07 03:08:47 Re: [BUG] Failed Assertion in ReorderBufferChangeMemoryUpdate()
Previous Message Michael Paquier 2021-09-07 02:30:59 Re: Postgres perl module namespace