Re: Timeout failure in 019_replslot_limit.pl

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Timeout failure in 019_replslot_limit.pl
Date: 2021-09-27 06:02:25
Message-ID: YVFecUdAYWUBIHxP@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 25, 2021 at 05:12:42PM +0530, Amit Kapila wrote:
> Now, in the failed run, it appears that due to some reason WAL sender
> has not released the slot. Is it possible to see if the WAL sender is
> still alive when a checkpoint is stuck at ConditionVariableSleep? And
> if it is active, what is its call stack?

I got again a failure today, so I have used this occasion to check that
when the checkpoint gets stuck the WAL sender process getting SIGCONT
is still around, waiting for a write to happen:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10
frame #1: 0x000000010fe50a43 postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1) at latch.c:1601:7
frame #2: 0x000000010fe4ffd0 postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, occurred_events=0x00007ffee0395fd0, nevents=1, wait_event_info=100663297) at latch.c:1396:8
frame #3: 0x000000010fc586c4 postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, len=122694) at be-secure.c:298:3
frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7
frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at pqcomm.c:1298:8
frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', s="SFATAL", len=112) at pqcomm.c:1479:6
frame #7: 0x000000010fc67318 postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9
frame #8: 0x00000001100a469f postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3
frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3
frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", lineno=3193, funcname="ProcessInterrupts") at elog.c:597:2
* frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at postgres.c:3191:4
frame #12: 0x000000010fe0bbe5 postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at walsender.c:2550)) at walsender.c:2285:3
frame #13: 0x000000010fe0754f postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3
frame #14: 0x000000010fe06149 postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\" 0/700000 TIMELINE 1") at walsender.c:1652:6
frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", username="mpaquier") at postgres.c:4493:12

It logs its FATAL "terminating connection due to administrator
command" coming from ProcessInterrupts(), and then it sits idle on
ClientWrite.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-09-27 06:23:07 Re: Timeout failure in 019_replslot_limit.pl
Previous Message Amit Kapila 2021-09-27 05:55:34 Re: Skipping logical replication transactions on subscriber side