Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver
Date: 2025-03-05 08:04:44
Message-ID: Z8gFnH4o3jBm5BRz@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Mar 05, 2025 at 12:35:26PM +0900, Michael Paquier wrote:
> Hi all,
>
> While doing some monitoring of a replication setup for a stable
> branch, I have been surprised by the fact that we have never tracked
> WAL statistics for the WAL receiver in pg_stat_wal because we have
> never bothered to update its code so as WAL stats are reported.

Nice catch!

> This
> is relevant for the write and sync counts and timings.

Also for sync? sync looks fine as issue_xlog_fsync() is being called in
XLogWalRcvFlush(), no?

> As of f4694e0f35b2, the situation is better thanks to the addition of
> a pgstat_report_wal() in the WAL receiver main loop, so we have some
> data. However, we are only able to gather the data for segment syncs
> and initializations, not the writes themselves as these are managed by
> an independent code path, XLogWalRcvWrite().
>
> A second thing that lacks in XLogWalRcvWrite() is a wait event around
> the pg_pwrite() call, which is useful as the WAL receiver is listed in
> pg_stat_activity. Note that it is possible to re-use the same wait
> event as XLogWrite() for the WAL receiver, WAL_WRITE, because the WAL
> receiver does not rely on the write and flush calls from xlog.c when
> doing its work, and both have the same meaning, aka they write WAL.
> The fsync calls use issue_xlog_fsync() and the segment inits happen in
> XLogFileInit().
>
> Perhaps there's a point in backpatching a portion of what's in the
> attached patch (the wait event?), but I am not planning to bother much
> with the stable branches based on the lack of complaints.

We're not emitting some statistics, so I think that it's hard for users to
complain about something they don't/can't see.

> If you
> have an opinion about that, please feel free.

I'm tempted to say that the wal receiver part of f4694e0f35b2 should be
backpatched as well as what you're doing here.

+ /*
+ * Measure I/O timing to write WAL data, for pg_stat_io.
+ */
+ start = pgstat_prepare_io_time(track_wal_io_timing);
+
+ pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+ pgstat_report_wait_end();

Same logic as in XLogWrite() and I don't think there is a need for a
dedicated wait event, so LGTM.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-03-05 08:08:55 Re: Statistics Import and Export
Previous Message Bertrand Drouvot 2025-03-05 07:34:16 Re: Add regression test checking combinations of (object,backend_type,context) in pg_stat_io