Re: Intermittent Issue with WAL Segment Removal in Logical Replication

From: Kaushik Iska <kaushik(at)peerdb(dot)io>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Sai Krishna Srirampur <sai(at)peerdb(dot)io>, Philip Dubé <philip(at)peerdb(dot)io>
Subject: Re: Intermittent Issue with WAL Segment Removal in Logical Replication
Date: 2023-12-29 21:28:40
Message-ID: CAHYLuV=_+_+OqXm=yuObjqPYfQhiBbTRJk9qFzhAte4TfNtV3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I am unfortunately not really familiar with Google Cloud SQL internals as
well. But we have seen this happen on Amazon RDS as well.

Could it be possible that we are requesting a future WAL segment, say WAL
upto X is written and we are asking for X + 1? It could be that the error
message is misleading.

I do not have the information from pg_replication_slots as I have
terminated the test. I am fairly certain that I can reproduce this again. I
will gather both the restart_lsn and contents of pg_wal for the failed
segment. Is there any other information that would help debug this further?

On Fri, Dec 29, 2023 at 4:16 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> On 12/27/23 16:31, Kaushik Iska wrote:
> > Hi all,
> >
> > I'm including additional details, as I am able to reproduce this issue a
> > little more reliably.
> >
> > Postgres Version: POSTGRES_14_9.R20230830.01_07
> > Vendor: Google Cloud SQL
> > Logical Replication Protocol version 1
> >
>
> I don't know much about Google Cloud SQL internals. Is it relatively
> close to Postgres (as e.g. RDS) or are the internals very different /
> modified for cloud environments?
>
> > Here are the logs of attempt succeeding right after it fails:
> >
> > 2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
> > STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> > 6/5AE67D79 (proto_version '1', publication_names
> > 'peerflow_pub_wal_testing_2') <- FAILS
> > 2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres
> > ERROR: requested WAL segment 000000010000000600000059 has already been
> > removed
> > 2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
> > STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
> > 6/5AE67D79 (proto_version '1', publication_names
> > 'peerflow_pub_wal_testing_2') <- SUCCEEDS
> > 2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres
> > LOG: logical decoding found consistent point at 6/5A31F050
> >
> > Happy to include any additional details of my setup.
> >
>
> I personally don't see how could this fail and then succeed, unless
> Google does something smart with the WAL segments under the hood. Surely
> we try to open the same WAL segment (given the LSN is the same), so how
> could it not exist and then exist?
>
> As Ron already suggested, it might be useful to see information for the
> replication slot peerflow_slot_wal_testing_2 (especially the restart_lsn
> value). Also, maybe show the contents of pg_wal (especially for the
> segment referenced in the error message).
>
> Can you reproduce this outside Google cloud environment?
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tomas Vondra 2023-12-30 00:26:35 Re: Intermittent Issue with WAL Segment Removal in Logical Replication
Previous Message Tomas Vondra 2023-12-29 21:16:44 Re: Intermittent Issue with WAL Segment Removal in Logical Replication