From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Get rid of WALBufMappingLock |
Date: | 2025-02-25 15:19:29 |
Message-ID: | CAPpHfdsXCRg-arg3CFAYR4PXWTc7G1qY=2V7Yap3j83_5qe+Hg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Feb 18, 2025 at 2:29 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Tue, Feb 18, 2025 at 2:21 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> > > This timeout failure on hachi looks suspicious as well:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
> > >
> > > Might be relevant that they are both aarch64?
> >
> > Just logged into the host. The logs of the timed out run are still
> > around, and the last information I can see is from lastcommand.log,
> > which seems to have frozen in time when the timeout has begun its
> > vacuuming work:
> > ok 73 + index_including_gist 353 ms
> > # parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse
> >
> > gokiburi is on the same host, and it is currently frozen in time when
> > trying to fetch a WAL buffer. One of the stack traces:
> > #2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
> > cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
> > latch.c:1571
> > #3 WaitEventSetWait (set=0xd34ce0, timeout=-1,
> > occurred_events=occurred_events(at)entry=0xffffffffadd8,
> > nevents=nevents(at)entry=1, wait_event_info=<optimized out>,
> > wait_event_info(at)entry=134217781) at latch.c:1519
> > #4 0x000000000084e964 in WaitLatch (latch=<optimized out>,
> > wakeEvents=wakeEvents(at)entry=33, timeout=timeout(at)entry=-1,
> > wait_event_info=wait_event_info(at)entry=134217781) at latch.c:538
> > #5 0x000000000085d2f8 in ConditionVariableTimedSleep
> > (cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
> > condition_variable.c:163
> > #6 0x000000000085d1ec in ConditionVariableSleep
> > (cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
> > #7 0x000000000055f4f4 in AdvanceXLInsertBuffer
> > (upto=upto(at)entry=112064880, tli=tli(at)entry=1, opportunistic=false) at
> > xlog.c:2224
> > #8 0x0000000000568398 in GetXLogBuffer (ptr=ptr(at)entry=112064880,
> > tli=tli(at)entry=1) at xlog.c:1710
> > #9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
> > isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
> > EndPos=<optimized out>, tli=1) at xlog.c:1245
> > #10 XLogInsertRecord (rdata=rdata(at)entry=0xcc49b0 <hdr_rdt>,
> > fpw_lsn=fpw_lsn(at)entry=112025520, flags=0 '\000', num_fpi=<optimized
> > out>, num_fpi(at)entry=0, topxid_included=false) at xlog.c:928
> > #11 0x000000000056b870 in XLogInsert (rmid=rmid(at)entry=16 '\020',
> > info=<optimized out>, info(at)entry=16 '\020') at xloginsert.c:523
> > #12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
> > state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
> > parent=<optimized out>,
> >
> > So, yes, something looks really wrong with this patch. Sounds
> > plausible to me that some other buildfarm animals could be stuck
> > without their owners knowing about it. It's proving to be a good idea
> > to force a timeout value in the configuration file of these animals..
>
> Tom, Michael, thank you for the information.
> This patch will be better tested before next attempt.
It seems that I managed to reproduce the issue on my Raspberry PI 4.
After running our test suite in a loop for 2 days I found one timeout.
I have hypothesis on why it might happen. We don't have protection
against two backends in parallel get ReservedPtr mapped to a single
XLog buffer. I've talked to Yura off-list about that. He pointer out
that XLogWrite() should issue a PANIC in that case, which we didn't
observe. However, I'm not sure this analysis is complete.
One way or another, we need protection against this situation any way.
The updated patch is attached. Now, after acquiring ReservedPtr it
waits till OldPageRqstPtr gets initialized. Additionally I've to
implement more accurate calculation of OldPageRqstPtr. I run tests
with new patch on my Raspberry in a loop. Let's see how it goes.
------
Regards,
Alexander Korotkov
Supabase
Attachment | Content-Type | Size |
---|---|---|
v11-0001-Get-rid-of-WALBufMappingLock.patch | application/octet-stream | 14.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Melanie Plageman | 2025-02-25 15:21:01 | Re: Trigger more frequent autovacuums of heavy insert tables |
Previous Message | Sami Imseih | 2025-02-25 15:12:03 | Re: Redact user password on pg_stat_statements |