Re: replication primary writting infinite number of WAL files

From: Les <nagylzs(at)gmail(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: replication primary writting infinite number of WAL files
Date: 2023-11-24 15:59:55
Message-ID: CAKXe9UDj27Z=YiNMU8DwtzVARy_D-umarOBY2OEfKYFwzbjJwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> (2023. nov. 24., P, 16:00):

> On Fri, 2023-11-24 at 12:39 +0100, Les wrote:
> > Under normal circumstances, the number of write operations is relatively
> low, with an
> > average of 4-5 MB/sec total write speed on the disk associated with the
> data directory.
> > Yesterday, the primary server suddenly started writing to the pg_wal
> directory at a
> > crazy pace, 1.5GB/sec, but sometimes it went up to over 3GB/sec.
> > [...]
> > Upon further analysis of the database, we found that we did not see any
> mass data
> > changes in any of the tables. The only exception is a sequence value
> that was moved
> > millions of steps within a single minute.
>
> That looks like some application went crazy and inserted millions of rows,
> but the
> inserts were rolled back. But it is hard to be certain with the clues
> given.
>

Writing of WAL files continued after we shut down all clients, and
restarted the primary PostgreSQL server.

The order was:

1. shut down all clients
2. stop the primary
3. start the primary
4. primary started to write like mad again
5. removed replication slot
6. primary stopped madness and deleted all WAL files (except for a few)

How can the primary server generate more and more WAL files (writes) after
all clients have been shut down and the server was restarted? My only bet
was the autovacuum. But I ruled that out, because removing a replication
slot has no effect on the autovacuum (am I wrong?). Now you are saying that
this looks like a huge rollback. Does rolling back changes require even
more data to be written to the WAL after server restart? As far as I know,
if something was not written to the WAL, then it is not something that can
be rolled back. Does removing a replication slot lessen the amount of data
needed to be written for a rollback (or for anything else)? It is a fact
that the primary stopped writing at 1.5GB/sec the moment we removed the
slot.

I'm not saying that you are wrong. Maybe there was a crazy application. I'm
just saying that a crazy application cannot be the whole picture. It cannot
explain this behaviour as a whole. Or maybe I have a deep misunderstanding
about how WAL files work. On the second occasion, the primary was running
for a few minutes when pg_wal started to increase. We noticed that early,
and shut down all clients, then restarted the primary server. After the
restart, the primary was writing out more WAL files for many more minutes,
until we dropped the slot again. E.g. it was writing much more data after
the restart than before the restart; and it only stopped (exactly) when we
removed the slot.

Regards,

Laszlo

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2023-11-24 16:11:56 Re: replication primary writting infinite number of WAL files
Previous Message Tom Lane 2023-11-24 15:34:45 Re: Odd Shortcut behaviour in PG14