Re: Intermittent Issue with WAL Segment Removal in Logical Replication

From: Kaushik Iska <kaushik(at)peerdb(dot)io>
To: pgsql-general(at)postgresql(dot)org
Cc: Sai Krishna Srirampur <sai(at)peerdb(dot)io>, Philip Dubé <philip(at)peerdb(dot)io>
Subject: Re: Intermittent Issue with WAL Segment Removal in Logical Replication
Date: 2023-12-27 15:31:14
Message-ID: CAHYLuV=M2YTxecoc1MH=TbChei7pAyk2gNLHnCM_eGSnGhjeOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

I'm including additional details, as I am able to reproduce this issue a
little more reliably.

Postgres Version: POSTGRES_14_9.R20230830.01_07
Vendor: Google Cloud SQL
Logical Replication Protocol version 1

Here are the logs of attempt succeeding right after it fails:

2023-12-27 01:12:40.581 UTC [59790]: [6-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- FAILS
2023-12-27 01:12:41.087 UTC [59790]: [7-1] db=postgres,user=postgres ERROR:
requested WAL segment 000000010000000600000059 has already been removed
2023-12-27 01:12:44.581 UTC [59794]: [3-1] db=postgres,user=postgres
STATEMENT: START_REPLICATION SLOT peerflow_slot_wal_testing_2 LOGICAL
6/5AE67D79 (proto_version '1', publication_names
'peerflow_pub_wal_testing_2') <- SUCCEEDS
2023-12-27 01:12:44.582 UTC [59794]: [4-1] db=postgres,user=postgres LOG:
logical decoding found consistent point at 6/5A31F050

Happy to include any additional details of my setup.

Thanks,
Kaushik

On Tue, Dec 26, 2023 at 10:36 AM Kaushik Iska <kaushik(at)peerdb(dot)io> wrote:

> Dear PostgreSQL Community,
>
> I am seeking guidance regarding a recurring issue we've encountered with
> WAL segment removal during logical replication using pgoutput plugin. We
> sporadically encounter an error indicating that a requested WAL segment has
> already been removed. This issue arises intermittently when executing
> START_REPLICATION. An example error message is as follows:
>
>
> requested WAL segment 000000010000146000000AE has already been removed
>
>
> Please note that this error is not specific to the segment mentioned
> above; it serves as an example of the type of error we are experiencing.
>
> Additional Context:
>
>
> -
>
> max_slot_wal_keep_size is -1, logical_decoding_work_mem is 4 GB.
> -
>
> The error seems to appear randomly and is not consistent.
> -
>
> After a couple of retries, the replication process eventually succeeds.
> -
>
> For one of the users it seems to be happening every 16 hours or so.
>
>
> Our approach involves starting with START_REPLICATION 0, replicating data
> in batches, and then restarting at the last LSN of the previous batch. We
> are trying to understand the root cause behind the intermittent removal of
> WAL segments during logical replication. Specifically, we are looking for
> insights into:
>
>
> -
>
> The potential reasons for the WAL segments being reported as removed.
> -
>
> Why this error occurs intermittently and why replication succeeds
> after several retries.
> -
>
> Any advice on troubleshooting and resolving this issue, or insights
> into whether it might be related to our specific replication setup or a
> characteristic of pgoutput, would be highly valuable.
>
>
> Related Posts
>
>
> -
>
> https://issues.redhat.com/browse/DBZ-590
> -
>
> Troubleshooting Postgres Sources | Airbyte Documentation
> <https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#under-cdc-incremental-mode-there-are-still-full-refresh-syncs>
> -
>
>
> https://fivetran.com/docs/databases/postgresql/troubleshooting/last-tracked-lsn-error
>
>
>
> Thank you very much for your time and assistance.
>
> Thanks,
>
> Kaushik Iska
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2023-12-27 15:40:55 Re: Changing a schema's name with function1 calling function2
Previous Message Kirk Wolak 2023-12-27 07:29:59 Re: Read write performance check