Re: Get rid of WALBufMappingLock

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Victor Yegorov <vyegorov(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-02-18 00:21:13
Message-ID: Z7PSeT7AbrjmiCCV@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 17, 2025 at 11:25:05AM -0500, Tom Lane wrote:
> This timeout failure on hachi looks suspicious as well:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2025-02-17%2003%3A05%3A03
>
> Might be relevant that they are both aarch64?

Just logged into the host. The logs of the timed out run are still
around, and the last information I can see is from lastcommand.log,
which seems to have frozen in time when the timeout has begun its
vacuuming work:
ok 73 + index_including_gist 353 ms
# parallel group (16 tests): create_cast errors create_aggregate drop_if_exists infinite_recurse

gokiburi is on the same host, and it is currently frozen in time when
trying to fetch a WAL buffer. One of the stack traces:
#2 0x000000000084ec48 in WaitEventSetWaitBlock (set=0xd34ce0,
cur_timeout=-1, occurred_events=0xffffffffadd8, nevents=1) at
latch.c:1571
#3 WaitEventSetWait (set=0xd34ce0, timeout=-1,
occurred_events=occurred_events(at)entry=0xffffffffadd8,
nevents=nevents(at)entry=1, wait_event_info=<optimized out>,
wait_event_info(at)entry=134217781) at latch.c:1519
#4 0x000000000084e964 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents(at)entry=33, timeout=timeout(at)entry=-1,
wait_event_info=wait_event_info(at)entry=134217781) at latch.c:538
#5 0x000000000085d2f8 in ConditionVariableTimedSleep
(cv=0xffffec0799b0, timeout=-1, wait_event_info=134217781) at
condition_variable.c:163
#6 0x000000000085d1ec in ConditionVariableSleep
(cv=0xfffffffffffffffc, wait_event_info=1) at condition_variable.c:98
#7 0x000000000055f4f4 in AdvanceXLInsertBuffer
(upto=upto(at)entry=112064880, tli=tli(at)entry=1, opportunistic=false) at
xlog.c:2224
#8 0x0000000000568398 in GetXLogBuffer (ptr=ptr(at)entry=112064880,
tli=tli(at)entry=1) at xlog.c:1710
#9 0x000000000055c650 in CopyXLogRecordToWAL (write_len=80,
isLogSwitch=false, rdata=0xcc49b0 <hdr_rdt>, StartPos=<optimized out>,
EndPos=<optimized out>, tli=1) at xlog.c:1245
#10 XLogInsertRecord (rdata=rdata(at)entry=0xcc49b0 <hdr_rdt>,
fpw_lsn=fpw_lsn(at)entry=112025520, flags=0 '\000', num_fpi=<optimized
out>, num_fpi(at)entry=0, topxid_included=false) at xlog.c:928
#11 0x000000000056b870 in XLogInsert (rmid=rmid(at)entry=16 '\020',
info=<optimized out>, info(at)entry=16 '\020') at xloginsert.c:523
#12 0x0000000000537acc in addLeafTuple (index=0xffffebf32950,
state=0xffffffffd5e0, leafTuple=0xe43870, current=<optimized out>,
parent=<optimized out>,

So, yes, something looks really wrong with this patch. Sounds
plausible to me that some other buildfarm animals could be stuck
without their owners knowing about it. It's proving to be a good idea
to force a timeout value in the configuration file of these animals..
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2025-02-18 00:29:50 Re: Get rid of WALBufMappingLock
Previous Message David G. Johnston 2025-02-18 00:05:03 Re: New "single" COPY format