Re: Logical replication timeout

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: RECHTÉ Marc <marc(dot)rechte(at)meteo(dot)fr>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication timeout
Date: 2024-11-06 13:49:48
Message-ID: CAExHW5s2_T9mULDQRKsdV72wpnA+NLT63cX51b51QQVEV4sG5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 6, 2024 at 1:07 PM RECHTÉ Marc <marc(dot)rechte(at)meteo(dot)fr> wrote:
>
> Hello,
>
> For some unknown reason (probably a very big transaction at the source), we experienced a logical decoding breakdown,
> due to a timeout from the subscriber side (either wal_receiver_timeout or connexion drop by network equipment due to inactivity).
>
> The problem is, that due to that failure, the wal_receiver process stops. When the wal_sender is ready to send some data, it finds the connexion broken and exits.
> A new wal_sender process is created that restarts from the beginning (restart LSN). This is an endless loop.
>
> Checking the network connexion between wal_sender and wal_receiver, we found that no traffic occurs for hours.
>
> We first increased wal_receiver_timeout up to 12h and still got a disconnection on the receiver party:
>
> 2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR: terminating logical replication worker due to timeout
> 2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG: background worker "logical replication worker" (PID 1356203) exited with exit code 1
>
> Then put this parameter to 0, but got then disconnected by the network (note the slight difference in message):
>
> 2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR: could not receive data from WAL stream: could not receive data from server: Connection timed out
> 2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG: background worker "logical replication worker" (PID 1697787) exited with exit code 1
>
> The message is generated in libpqrcv_receive function (replication/libpqwalreceiver/libpqwalreceiver.c) which calls pqsecure_raw_read (interfaces/libpq/fe-secure.c)
>
> The last message "Connection timed out" is the errno translation from the recv system function:
>
> ETIMEDOUT Connection timed out (POSIX.1-2001)
>
> When those timeout occurred, the sender was still busy deleting files from data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small ".spill" files.
> It seems this very long pause is at cleanup stage were PG is blindly trying to delete those files.
>
> strace on wal sender show tons of calls like:
>
> unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
>
> This occurs in ReorderBufferRestoreCleanup (backend/replication/logical/reorderbuffer.c).
> The call stack presumes this may probably occur in DecodeCommit or DecodeAbort (backend/replication/logical/decode.c):
>
> unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 ENOENT (Aucun fichier ou dossier de ce type)
> > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
> > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) [0x769e3d]
> > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== replication/logical/reorderbuff.c:1480 (mais cette fonction (static) n'est utiliée qu'au sein de ce module ...)
> > /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== replication/logical/decode.c:175
> > /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== replication/logical/decode.c:90, appelle la fonction rmgr.rm_decode(ctx, &buf) = 1 des 6 méthodes du resource manager
> > /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e]
> > /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121]
> > /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a]
> > /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588]
> > /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97]
> > /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c]
> > /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05]
> > /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554]
> > /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8]

I think, we need a call to rb->update_progress_txn(rb, txn,
change->lsn) at regular intervals in ReorderBufferRestoreCleanup()
similar to ReorderBufferProcessTXN(). And may be at more places where
we have potentially long running loops.

--
Best Wishes,
Ashutosh Bapat

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-11-06 13:51:02 Re: per backend I/O statistics
Previous Message Ryohei Takahashi (Fujitsu) 2024-11-06 13:06:55 RE: COPY performance on Windows