Re: [RFC] Lock-free XLog Reservation from WAL

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [RFC] Lock-free XLog Reservation from WAL
Date: 2025-01-10 18:33:57
Message-ID: 7b31f916-2b7d-49c7-b70a-b0342ba6b423@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

10.01.2025 19:53, Matthias van de Meent пишет:
> On Fri, 10 Jan 2025 at 13:42, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>>
>> BTW, your version could make alike trick for guaranteed atomicity:
>> - change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
>> and store offset to prev record's start.
>
> -1, I don't think that is possible without degrading what our current
> WAL system protects against.
>
> For intra-record torn write protection we have the checksum, but that
> same protection doesn't cover the multiple WAL records on each page.
> That is what the xl_prev pointer is used for - detecting that this
> part of the page doesn't contain the correct data (e.g. the data of a
> previous version of this recycled segment).
> If we replaced xl_prev with just an offset into the segment, then this
> protection would be much less effective, as the previous version of
> the segment realistically used the same segment offsets at the same
> offsets into the file.

Well, to protect against "torn write" it is enough to have "self-lsn"
field, not "prev-lsn". So 8 byte "self-lsn" + "offset-to-prev" would work.

But this way header will be increased by 4 bytes compared to current
one, not decreased.

Just thought:
If XLogRecord alignment were stricter (for example, 32 bytes), then LSN
could mean not byte-offset, but 32byte-offset. Then low 32bits of LSN
will cover 128GB of WAL logs. For most installations re-use distance for
WAL segments doubdfully longer than 128GB. But I believe, there are some
with larger one. So it is not reliable.

> To protect against torn writes while still only using record segment
> offsets, you'd have zero and then fsync any segment before reusing it,
> which would severely reduce the benefits we get from recycling
> segments.
> Note that we can't expect the page header to help here, as write tears
> can happen at nearly any offset into the page - not just 8k intervals
> - and so the page header is not always representative of the origins
> of all bytes on the page - only the first 24 (if even that).

-----

regards,
Yura

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-01-10 18:42:11 Re: Reorder shutdown sequence, to flush pgstats later
Previous Message James Hunter 2025-01-10 18:00:15 Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators