Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
Date: 2024-06-06 02:49:45
Message-ID: CAHut+PtZk8Q3k_gymTqkiBueB=BLAXBuhRfvvbc3wstXg7bzUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, I have reproduced this multiple times now.

I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1] gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.

~~~

I've made a patch to inject lots of logging, and when the test script
fails a cycle of function failures can be seen. I don't know how to
fix it yet, so I'm attaching my log results, hoping the information
may be useful for anyone familiar with this area of the code.

~~~

Attachment #1 "v1-0001-DEBUG-LOGGING.patch" -- Patch to inject some
logging. Be careful if you apply this because the resulting log files
can be huge (e.g. 3G)

Attachment #2 "bad8_logs_last500lines.txt" -- This is the last 500
lines of a 3G logfile from a failing test run.

Attachment #3 "bad8_logs_last500lines-simple.txt" -- Same log file as
above, but it's a simplified extract in which I showed the CYCLES of
failure more clearly.

Attachment #4 "bad8_digram"-- Same execution patch information as from
the log files, but in diagram form (just to help me visualise the
logic more easily).

~~~

Just so you know, the test script does not always cause the problem.
Sometimes it happens after just 20 script iterations. Or, sometimes it
takes a very long time and multiple runs (e.g. 400-500 script
iterations). Either way, when the problem eventually occurs the CYCLES
of the ReadPageInternal() failures always have the the same pattern
shown in these attached logs.

======
[1] OP - https://www.postgresql.org/message-id/f15d665f-4cd1-4894-037c-afdbe369287e%40gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
bad8_logs_last500lines.txt text/plain 70.0 KB
v1-0001-DEBUG-LOGGING.patch application/octet-stream 19.2 KB
bad8_logs_last500lines-simple.txt text/plain 11.2 KB
bad8_diagram.pdf application/pdf 146.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-06-06 02:59:19 RE: Pgoutput not capturing the generated columns
Previous Message Robert Haas 2024-06-06 02:47:26 Re: [multithreading] extension compatibility