Re: Logical replication timeout problem

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tang, Haiying/唐 海英 <tanghy(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication timeout problem
Date: 2021-12-29 11:32:41
Message-ID: CAA5-nLB2K0YTaYtd-qUwy+QTUA0Q0fZt9_R=mFmb9pOVObgBBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I put the instance with high level debug mode.
I try to do some log interpretation: After having finished writing the
modifications generated by the insert in the snap files,
then these files are read (restored). One minute after this work starts,
the worker process exit with an error code = 1.
I see that keepalive messages were sent before the work process work leave.

2021-12-28 10:50:01.894 CET [55792] LOCATION: WalSndKeepalive,
walsender.c:3365
...
2021-12-28 10:50:31.854 CET [55792] STATEMENT: START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: StartTransaction(1)
name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2021-12-28 10:50:31.907 CET [55792] LOCATION: ShowTransactionStateRec,
xact.c:5075
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2021-12-28 10:50:31.907 CET [55792] DEBUG: 00000: spill 2271 changes in
XID 14312 to disk
2021-12-28 10:50:31.907 CET [55792] LOCATION: ReorderBufferSerializeTXN,
reorderbuffer.c:2245
2021-12-28 10:50:31.907 CET [55792] STATEMENT: START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
*2021-12-28 10:50:32.110 CET [55792] DEBUG: 00000: restored 4096/22603999
changes from disk*
2021-12-28 10:50:32.110 CET [55792] LOCATION: ReorderBufferIterTXNNext,
reorderbuffer.c:1156
2021-12-28 10:50:32.110 CET [55792] STATEMENT: START_REPLICATION SLOT
"sub008_s012a00" LOGICAL 17/27240748 (proto_version '1', publication_names
'"pub008_s012a00"')
2021-12-28 10:50:32.138 CET [55792] DEBUG: 00000: restored 4096/22603999
changes from disk
...

*2021-12-28 10:50:35.341 CET [55794] DEBUG: 00000: sending replication
keepalive2021-12-28 10:50:35.341 CET [55794] LOCATION: WalSndKeepalive,
walsender.c:3365*
...
*2021-12-28 10:51:31.995 CET [55791] ERROR: XX000: terminating logical
replication worker due to timeout*

*2021-12-28 10:51:31.995 CET [55791] LOCATION: LogicalRepApplyLoop,
worker.c:1267*

Could this function in* Apply main loop* in worker.c help to find a
solution?

rc = WaitLatchOrSocket(MyLatch,
WL_SOCKET_READABLE | WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
fd, wait_time,
WAIT_EVENT_LOGICAL_APPLY_MAIN);
Thanks for your help

Fabrice

On Thu, Dec 23, 2021 at 11:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Wed, Dec 22, 2021 at 8:50 PM Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
> wrote:
> >
> > Hello Amit,
> >
> > I was able to reproduce the timeout problem in the lab.
> > After loading more than 20 millions of rows in a table which is not
> replicated (insert command ends without error), errors related to logical
> replication processes appear in the postgres log.
> > Approximately every 5 minutes worker process is restarted. The snap
> files in the slot directory are still present. The replication system seems
> to be blocked. Why these snap files are not removed. What do they contain?
> >
>
> These contain changes of insert. I think these are not removed for
> your case as your long transaction is never finished. As mentioned
> earlier, for such cases, it is better to use 'streaming' feature
> released as part of PG-14 but anyway here we are trying to debug
> timeout problem.
>
> > I will recompile postgres with your patch to debug.
> >
>
> Okay, that might help.
>
> --
> With Regards,
> Amit Kapila.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2021-12-29 11:50:23 Re: Converting WAL to SQL
Previous Message Peter Eisentraut 2021-12-29 11:08:17 Re: automatically generating node support functions