Re[2]: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284

From: Дмитрий <dsolik(at)mail(dot)ru>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re[2]: FATAL: could not send data to WAL stream: lost synchronization with server: got message type "0", length 892351284
Date: 2025-01-26 11:29:16
Message-ID: 1737890956.919358915@f75.i.mail.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


"How was it shut down, on purpose or a hardware/software issue?"
- I reboot the receiver every 2 minutes on purpose. I determined this time empirically, because replication breaks down approximately every minute and a half. The reboot helps to advance the receiver.

"Also do you have corresponding logs from primary?"
- Attached to this message.

"Unless, is there cascading replication going on?"
- No, this is replication from the leader. The leader has its two replicas and they are all in one data center. And the problematic replica is needed to migrate to another data center.

"Was that a manual intervention?"
- Yes, reboot on schedule, every two minutes.

"Is that what is shown above or have you restarted since the above and
the server is running?"
- Sometimes replication works without problems for several hours. But when a breakdown occurs, rebooting every two minutes helps to catch up with this replica.

Another interesting point. In addition to this replication, there are two more, to the same data center. One replication had the same problem, but a one-time restart helped to solve the problem, the replication is still working normally. And the second replication does not have such problems, it has been working since its launch, more than a month ago.
-- суббота, 25 января 2025г., 20:32 +03:00 от Adrian Klaver adrian(dot)klaver(at)aklaver(dot)com :

>On 1/25/25 09:03, Дмитрий wrote:
> 1) What sort of replication?
> - Streaming replication
>
> 2) Where are the two servers located relative to each other?
> - The servers are located in different data centers.
>
> 3) Has there been any software upgrades/network changes recently?
> - I don't know any information about the  software upgrades/network
>
>It would be a good thing to ask of those folks that do know.
>
> From the log attached to your initial post:
>
>2025-01-25 17:28:01.930 MSK [1196013] LOG: starting PostgreSQL 15.10 on
>x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat
>11.5.0-2), 64-bit
>2025-01-25 17:28:01.930 MSK [1196013] LOG: listening on IPv4 address
>"0.0.0.0", port 5432
>2025-01-25 17:28:01.931 MSK [1196013] LOG: listening on Unix socket
>"/run/postgresql/.s.PGSQL.5432"
>2025-01-25 17:28:01.932 MSK [1196013] LOG: listening on Unix socket
>"/tmp/.s.PGSQL.5432"
>2025-01-25 17:28:01.962 MSK [1196017] LOG: database system was shut
>down in recovery at 2025-01-25 17:28:01 MSK
>2025-01-25 17:28:01.962 MSK [1196017] LOG: entering standby mode
>
>How was it shut down, on purpose or a hardware/software issue?
>
>Also do you have corresponding logs from primary?
>
>2025-01-25 17:28:12.192 MSK [1196017] LOG: consistent recovery state
>reached at 1063C/D002DC68
>2025-01-25 17:28:12.192 MSK [1196017] LOG: incorrect resource manager
>data checksum in record at 1063C/D002DC68
>2025-01-25 17:28:12.192 MSK [1196013] LOG: database system is ready to
>accept read-only connections
>2025-01-25 17:28:12.205 MSK [1196019] LOG: started streaming WAL from
>primary at 1063C/D0000000 on timeline 61
>
>The recovery ended and the streaming started.
>
>Not sure if 'incorrect resource manager data checksum' is significant or
>not.
>
>2025-01-25 17:29:08.452 MSK [1196015] LOG: recovery restart point at
>1063C/DBC7E1D8
>2025-01-25 17:29:08.452 MSK [1196015] DETAIL: Last completed
>transaction was at log time 2025-01-25 16:23:08.828548+03.
>2025-01-25 17:29:24.553 MSK [1196015] LOG: restartpoint starting: wal
>2025-01-25 17:29:24.553 MSK [1196015] DEBUG: performing replication
>slot checkpoint
>2025-01-25 17:29:27.651 MSK [1196019] FATAL: could not send data to WAL
>stream: lost synchronization with server: got message type "0", length
>892351284
>2025-01-25 17:29:27.653 MSK [1196017] LOG: invalid magic number 3600 in
>log segment 0000003D0001063D000000F4, offset 212992
>2025-01-25 17:29:27.653 MSK [1196017] LOG: invalid magic number 3600 in
>log segment 0000003D0001063D000000F4, offset 212992
>2025-01-25 17:29:27.653 MSK [1196017] LOG: invalid magic number 3600 in
>log segment 0000003D0001063D000000F4, offset 212992
>
>This is where things fall apart. What confuses me is:
>
>"could not send data to WAL stream: lost synchronization with server:
>got message type "0", length 892351284 "
>
>If this is from the standby why is it sending data to the stream?
>
>Unless, is there cascading replication going on?
>
>
>2025-01-25 17:30:01.887 MSK [1196013] LOG: received fast shutdown request
>2025-01-25 17:30:01.888 MSK [1196013] LOG: aborting any active transactions
>
>Was that a manual intervention?
>
>2025-01-25 17:30:02.157 MSK [1196015] LOG: shutting down
>2025-01-25 17:30:02.181 MSK [1196013] LOG: database system is shut down
>2025-01-25 17:30:02.182 MSK [1196014] DEBUG: logger shutting down
>
>
>So the server went from start up to shut down in ~2 minutes.
>
> From your original post:
>
>'Restarting PostgreSQL helps.'
>
>Is that what is shown above or have you restarted since the above and
>the server is running?
>
>
>
>
>
>
>--
>Adrian Klaver
>adrian(dot)klaver(at)aklaver(dot)com

Attachment Content-Type Size
source_postgresql-Sun_7.log text/x-log 1.5 MB
receiver_postgresql-Sun_7.log text/x-log 193.1 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2025-01-26 15:37:05 Re: Using Expanded Objects other than Arrays from plpgsql
Previous Message Andrey Borodin 2025-01-26 10:07:22 Re: Using Expanded Objects other than Arrays from plpgsql