Quick Links

Re: failure in 019_replslot_limit

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject:	Re: failure in 019_replslot_limit
Date:	2024-02-10 03:00:01
Message-ID:	ff7bad44-bc27-7179-e9ed-79cb6866fe03@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

09.02.2024 21:59, Andres Freund wrote:
>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-02-04%2001%3A53%3A44
>> ) and saw that it's not checkpointer, but walsender is hanging:
> How did you reproduce this?

As kestrel didn't produce this failure until recently, I supposed that the
cause is the same as with subscription/031_column_list — longer test
duration, so I ran this test in parallel (with 20-30 jobs) in a slowed
down VM, so that one successful test duration increased to 100-120 seconds.
And I was lucky enough to catch it within 100 iterations. But now, that we
know what's happening there, I think I could reproduce it much easily,
with some sleep(s) added, if it would be of any interest.

> So it's the issue that we wait effectively forever to to send a FATAL. I've
> previously proposed that we should not block sending out fatal errors, given
> that allows clients to do prevent graceful restarts and a lot of other things.
>

Yes, I had demonstrated one of those unpleasant things previously too:
https://www.postgresql.org/message-id/91c8860a-a866-71a7-a060-3f07af531295%40gmail.com

Best regards,
Alexander

In response to

Re: failure in 019_replslot_limit at 2024-02-09 18:59:15 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Zhijie Hou (Fujitsu)	2024-02-10 03:37:25	RE: Synchronizing slots from primary to standby
Previous Message	Soumyadeep Chakraborty	2024-02-10 01:56:19	Re: "ERROR: latch already owned" on gharial