From: | Peter Smith <smithpb2250(at)gmail(dot)com> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: 001_rep_changes.pl fails due to publisher stuck on shutdown |
Date: | 2024-06-06 02:49:45 |
Message-ID: | CAHut+PtZk8Q3k_gymTqkiBueB=BLAXBuhRfvvbc3wstXg7bzUA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, I have reproduced this multiple times now.
I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1] gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.
~~~
I've made a patch to inject lots of logging, and when the test script
fails a cycle of function failures can be seen. I don't know how to
fix it yet, so I'm attaching my log results, hoping the information
may be useful for anyone familiar with this area of the code.
~~~
Attachment #1 "v1-0001-DEBUG-LOGGING.patch" -- Patch to inject some
logging. Be careful if you apply this because the resulting log files
can be huge (e.g. 3G)
Attachment #2 "bad8_logs_last500lines.txt" -- This is the last 500
lines of a 3G logfile from a failing test run.
Attachment #3 "bad8_logs_last500lines-simple.txt" -- Same log file as
above, but it's a simplified extract in which I showed the CYCLES of
failure more clearly.
Attachment #4 "bad8_digram"-- Same execution patch information as from
the log files, but in diagram form (just to help me visualise the
logic more easily).
~~~
Just so you know, the test script does not always cause the problem.
Sometimes it happens after just 20 script iterations. Or, sometimes it
takes a very long time and multiple runs (e.g. 400-500 script
iterations). Either way, when the problem eventually occurs the CYCLES
of the ReadPageInternal() failures always have the the same pattern
shown in these attached logs.
======
[1] OP - https://www.postgresql.org/message-id/f15d665f-4cd1-4894-037c-afdbe369287e%40gmail.com
Kind Regards,
Peter Smith.
Fujitsu Australia
Attachment | Content-Type | Size |
---|---|---|
bad8_logs_last500lines.txt | text/plain | 70.0 KB |
v1-0001-DEBUG-LOGGING.patch | application/octet-stream | 19.2 KB |
bad8_logs_last500lines-simple.txt | text/plain | 11.2 KB |
bad8_diagram.pdf | application/pdf | 146.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Hayato Kuroda (Fujitsu) | 2024-06-06 02:59:19 | RE: Pgoutput not capturing the generated columns |
Previous Message | Robert Haas | 2024-06-06 02:47:26 | Re: [multithreading] extension compatibility |