Logical replication timeout

From: RECHTÉ Marc <marc(dot)rechte(at)meteo(dot)fr>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Logical replication timeout
Date: 2024-11-06 07:36:51
Message-ID: 638764862.181008636.1730878611279.JavaMail.zimbra@meteo.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown,
due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to inactivity).

The problem is, that due to that failure, the wal_receiver process stops. When the wal_sender is ready to send some data, it finds the connexion broken and exits.
A new wal_sender process is created that restarts from the beginning (restart LSN). This is an endless loop.

Checking the network connexion between wal_sender and wal_receiver, we found that no traffic occurs for hours.

We first increased wal_receiver_timeout up to 12h and still got a disconnection on the receiver party:

2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR: terminating logical replication worker due to timeout
2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG: background worker "logical replication worker" (PID 1356203) exited with exit code 1

Then put this parameter to 0, but got then disconnected by the network (note the slight difference in message):

2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR: could not receive data from WAL stream: could not receive data from server: Connection timed out
2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG: background worker "logical replication worker" (PID 1697787) exited with exit code 1

The message is generated in libpqrcv_receive function (replication/libpqwalreceiver/libpqwalreceiver.c) which calls pqsecure_raw_read (interfaces/libpq/fe-secure.c)

The last message "Connection timed out" is the errno translation from the recv system function:

ETIMEDOUT Connection timed out (POSIX.1-2001)

When those timeout occurred, the sender was still busy deleting files from data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small ".spill" files.
It seems this very long pause is at cleanup stage were PG is blindly trying to delete those files.

strace on wal sender show tons of calls like:

unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)

This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c).
The call stack presumes this may probably occur in DecodeCommit or DecodeAbort (backend/replication/logical/decode.c):

unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
> /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
> /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d]
> /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480 (mais cette fonction (static) n'est utiliée qu'au sein de ce module ...)
> /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175
> /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90, appelle la fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager
> /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
> /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
> /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
> /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
> /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
> /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
> /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
> /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
> /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]

We did not find any other option than deleting the subscription to stop that loop and start a new one (thus loosing transactions).

The publisher is PostgreSQL 15.6
The subscriber is PostgreSQL 14.5

Thanks

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-11-06 08:21:54 Re: Eager aggregation, take 3
Previous Message jian he 2024-11-06 07:10:23 Re: general purpose array_sort