Re: Excessive number of replication slots for 12->14 logical replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Hubert Lubaczewski <depesz(at)depesz(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Excessive number of replication slots for 12->14 logical replication
Date: 2022-09-10 05:36:44
Message-ID: CAA4eK1LQOjPjru+xn5bDGq4jaA24t=pkD7DODUv224_MmDu-3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Sep 10, 2022 at 3:19 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > Pushed.
>
> Tom reported buildfarm failures[1] and I've investigated the cause and
> concluded this commit is relevant.
>

I was checking the failure logs at [1]. In below logs from 100_bugs_twoways.log:

2022-09-09 09:30:18.950 EDT [631b3fea.1c0:24]
pg_16400_sync_16392_7141371862484106124 STATEMENT: START_REPLICATION
SLOT "pg_16400_sync_16392_7141371862484106124" LOGICAL 0/1CC9510
(proto_version '3', origin 'any', publication_names '"testpub"')
2022-09-09 09:30:18.952 EDT [631b3fea.1c0:25]
pg_16400_sync_16392_7141371862484106124 ERROR: could not find record
while sending logically-decoded data: missing contrecord at 0/1CCF9F0
2022-09-09 09:30:18.952 EDT [631b3fea.1c0:26]
pg_16400_sync_16392_7141371862484106124 STATEMENT: START_REPLICATION
SLOT "pg_16400_sync_16392_7141371862484106124" LOGICAL 0/1CC9510
(proto_version '3', origin 'any', publication_names '"testpub"')
ERROR: could not find record while sending logically-decoded data:
missing contrecord at 0/1CCF9F0
2022-09-09 09:30:18.952 EDT [631b3fea.2b14:2] ERROR: error while
shutting down streaming COPY: ERROR: could not find record while
sending logically-decoded data: missing contrecord at 0/1CCF9F0
2022-09-09 09:30:18.953 EDT [631b3fea.1c0:27]
pg_16400_sync_16392_7141371862484106124 LOG: disconnection: session
time: 0:00:00.168 user=pgrunner database=d1 host=127.0.0.1 port=53881
2022-09-09 09:30:18.965 EDT [631b3fe6.137c:4] LOG: background worker
"logical replication worker" (PID 11028) exited with exit code 1
....
.....
2022-09-09 09:30:19.437 EDT [631b3feb.26e8:1] LOG: logical
replication table synchronization worker for subscription "testsub",
table "t" has started
....
....
2022-09-09 09:30:19.513 EDT [631b3feb.840:5]
pg_16400_sync_16392_7141371862484106124 LOG: received replication
command: START_REPLICATION SLOT
"pg_16400_sync_16392_7141371862484106124" LOGICAL 0/0 (proto_version
'3', origin 'any', publication_names '"testpub"')
2022-09-09 09:30:19.513 EDT [631b3feb.840:6]
pg_16400_sync_16392_7141371862484106124 STATEMENT: START_REPLICATION
SLOT "pg_16400_sync_16392_7141371862484106124" LOGICAL 0/0
(proto_version '3', origin 'any', publication_names '"testpub"')
2022-09-09 09:30:19.515 EDT [631b3feb.840:7]
pg_16400_sync_16392_7141371862484106124 LOG: starting logical
decoding for slot "pg_16400_sync_16392_7141371862484106124"
2022-09-09 09:30:19.515 EDT [631b3feb.840:8]
pg_16400_sync_16392_7141371862484106124 DETAIL: Streaming
transactions committing after 0/1CCF9F0, reading WAL from 0/1CC94D8.

One thing is not clear to me how the first time error: "could not find
record while sending logically-decoded data ..." can happen due to
this commit? Also, based on the origin even if the client sends a
prior location (0/0 in this case) but the server will still start from
the location where the client has confirmed the commit (aka
confirmed_flush location). Isn't that the reason why the above LOG
says (reading WAL from 0/1CC94D8)? See the below code in
CreateDecodingContext():

if (start_lsn == InvalidXLogRecPtr)
{
/* continue from last position */
start_lsn = slot->data.confirmed_flush;

If this theory is correct then this commit won't be the cause of BF
failure but it is quite possible that I am missing something here.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2022-09-09+12%3A03%3A46

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2022-09-10 06:15:29 Re: Excessive number of replication slots for 12->14 logical replication
Previous Message houzj.fnst@fujitsu.com 2022-09-10 03:41:01 RE: Excessive number of replication slots for 12->14 logical replication