Re: pg_walfile_name_offset can return inconsistent values

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_walfile_name_offset can return inconsistent values
Date: 2023-11-13 17:14:57
Message-ID: ZVJZkU9O98Jbydml@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 10, 2023 at 07:59:43PM -0800, Andres Freund wrote:
> Hi,
>
> On 2023-11-09 16:14:07 -0500, Bruce Momjian wrote:
> > On Thu, Nov 9, 2023 at 09:49:48PM +0100, Matthias van de Meent wrote:
> > > Either way, I think fix #1 is most correct (as was attached in
> > > offset2.diff, and quoted verbatim here), because that has no chance of
> > > having surprising underflowing behaviour when you use '0/0'::lsn as
> > > input.
> >
> > Attached is the full patch that changes pg_walfile_name_offset() and
> > pg_walfile_name(). There is no need for doc changes.
>
> I think this needs to add tests "documenting" the changed behaviour and
> perhaps also for a few other edge cases. You could e.g. test
> SELECT * FROM pg_walfile_name_offset('0/0')
> which today returns bogus values, and which is independent of the wal segment
> size.
>
> And with
> SELECT setting::int8 AS segment_size FROM pg_settings WHERE name = 'wal_segment_size' \gset
> you can test real things too, e.g.:
> SELECT segment_number, file_offset FROM pg_walfile_name_offset('0/0'::pg_lsn + :segment_size), pg_split_walfile_name(file_name);
> SELECT segment_number, file_offset FROM pg_walfile_name_offset('0/0'::pg_lsn + :segment_size + 1), pg_split_walfile_name(file_name);
> SELECT segment_number, file_offset = :segment_size - 1 FROM pg_walfile_name_offset('0/0'::pg_lsn + :segment_size - 1), pg_split_walfile_name(file_name);

Sure, I have added these tests.

FYI, pg_walfile_name_offset() has this C comment, which I have removed
in this patch;

* Note that a location exactly at a segment boundary is taken to be in
* the previous segment. This is usually the right thing, since the
* expected usage is to determine which xlog file(s) are ready to archive.

There is also a documentation mention of this behavior:

When the given write-ahead log location is exactly at a write-ahead log file boundary, both
these functions return the name of the preceding write-ahead log file.
This is usually the desired behavior for managing write-ahead log archiving
behavior, since the preceding file is the last one that currently
needs to be archived.

After seeing the doc mention, I started digging into the history of this
feature. It was originally called pg_current_xlogfile_offset() and
proposed in this email thread, which started on 2006-07-31:

https://www.postgresql.org/message-id/flat/1154384790.3226.21.camel%40localhost.localdomain

In the initial patch by Simon Riggs, there was no "previous segment
file" behavior, just a simple filename/offset calculation.

This was applied on 2006-08-06 with this commit:

commit 704ddaaa09
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Sun Aug 6 03:53:44 2006 +0000

Add support for forcing a switch to a new xlog file; cause such a switch
to happen automatically during pg_stop_backup(). Add some functions for
interrogating the current xlog insertion point and for easily extracting
WAL filenames from the hex WAL locations displayed by pg_stop_backup
and friends. Simon Riggs with some editorialization by Tom Lane.

and the email of the commit said:

https://www.postgresql.org/message-id/18457.1154836638%40sss.pgh.pa.us

I also made the new user-level functions a bit more orthogonal, so
that filenames could be extracted from the existing functions like
pg_stop_backup.

There is later talk about returning last write pointer vs. the current
insert pointer, and having it match what is returned by pg_stop_backup():

https://www.postgresql.org/message-id/1155124565.2368.95.camel%40localhost.localdomain

Methinks it should be the Write pointer all of the time, since I can't
think of a valid reason for wanting to know where the Insert pointer is
*before* we've written to the xlog file. Having it be the Insert pointer
could lead to some errors.

and I suspect that it was the desire to return the last write pointer
that caused the function to return the previous segment on a boundary
offset. This was intended to simplify log shipping implementations, I
think.

The function eventually was renamed in the xlog-to-wal renaming and moved
from xlog.c to xlogfuncs.c. This thread in 2022 mentioned the
inconsistency for 0/0, but didn't seem to talk about the inconsistency
of offset vs file name:

https://www.postgresql.org/message-id/flat/20220204225057.GA1535307%40nathanxps13#d964275c9540d8395e138efc0a75f7e8

and it concluded with:

Yes, its the deliberate choice of design, or a kind of
questionable-but-unoverturnable decision. I think there are many
external tools conscious of this behavior.

However, with the report about the inconsistency, the attached patch
fixes the behavior and removes the documentation about the odd behavior.
This will need to be mentioned as an incompatibility in the major
version release notes.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

Attachment Content-Type Size
offset1.diff text/x-diff 5.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-11-13 17:21:34 Re: Is this a problem in GenericXLogFinish()?
Previous Message Joe Conway 2023-11-13 16:57:56 Re: How to solve the problem of one backend process crashing and causing other processes to restart?