Re: replication primary writting infinite number of WAL files

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Les <nagylzs(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: replication primary writting infinite number of WAL files
Date: 2023-11-24 16:50:19
Message-ID: 38f7d87c-ae22-48e8-a4c4-0acde1ad6eb9@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 11/24/23 03:39, Les wrote:
> Hello,
> Yesterday, the primary server suddenly started
> writing to the pg_wal directory at a crazy pace, 1.5GB/sec, but
> sometimes it went up to over 3GB/sec. The pg_wal started fattening up
> and didn't stop until it ran out of disk space. It happened so fast that
> we didn't have time to react. We then stopped all applications
> (postgresql clients) because we thought one of them was causing the
> problem.

> The only exception is a sequence
> value that was moved millions of steps within a single minute. Of

Did you determine this by looking at select * from some_seq?

> This new instance worked for about 12 hours. This morning, the
> error occurred again, in the same form. Based on our previous
> experience, we immediately deleted the standby and its replication slot,
> and the problem resolved itself (except that the standby had to be
> deleted again). Without rebooting or restarting anything else, the
> problem went away. I managed to save small part of the pg_wal before
> deleting the slot. We looked into this, we saw something like this:

Are the servers open to the world and if so have you explored whether
there has been an intrusion?

Do you have logs that cover the period from when it transitioned from
working normally to going haywire?

> We looked at the PostgreSQL release history, and we see some bug fixes
> in version 14.7 that might have something to do with this:
>
> https://www.postgresql.org/docs/release/14.7/
> <https://www.postgresql.org/docs/release/14.7/>
>
> > Ignore invalidated logical-replication slots while determining oldest
> catalog xmin (Sirisha Chamarthi) A replication slot could prevent
> cleanup of dead tuples in the system catalogs even after it becomes
> invalidated due to exceeding max_slot_wal_keep_size. Thus, failure of a
> replication consumer could lead to indefinitely-large catalog bloat.
>

You are using repmgr which as I understand it uses streaming not logical
replication.

> Thank you,
>
>    Laszlo
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2023-11-24 16:52:25 Re: Inquiry Regarding Initial Seed for pgsql Protocol Fuzz Testing
Previous Message Zahir Lalani 2023-11-24 16:46:42 RE: Odd Shortcut behaviour in PG14