Re: Pre-allocating WAL files

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Pre-allocating WAL files
Date: 2025-01-22 15:50:59
Message-ID: Z5ET4xCZJEIx3bKK@nathan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 21, 2025 at 11:23:06AM -0500, Andres Freund wrote:
> On 2025-01-21 10:13:14 -0600, Nathan Bossart wrote:
>> On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote:
>> > On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
>> >> 3. Why is the purpose of preallocated_segments directory? what in my
>> >> mind is we just prellocate the normal filename so that XLogWrite could
>> >> open it directly. This is same as what wal_recycle does and we can reuse
>> >> the same strategy to clean up them if they are not needed anymore.
>> >
>> > The purpose is to limit the use of pre-allocated segments to only
>> > situations where WAL recycling is not sufficient. Basically, if writing a
>> > record would require a new segment to be created, we can quickly pull a
>> > pre-allocated one instead of creating it ourselves. Besides simplifying
>> > matters, this prevents a lot of unnecessary pre-allocation, since many
>> > workloads will almost never need anything beyond the recycled segments.
>
> I don't really understand that argument - we should be able to predict rather
> precisely whether we need to preallocate or not. We have the recent WAL "fill
> rate", we know the end of the WAL and we can easily track how far ahead of the
> current point we have allocated. Why preallocate when we have a large reserve
> of "future" segments? Why preallocate in a separate directory when we have no
> future segments?

If we can indeed reliably predict whether we need pre-allocation, then
sure, let's just create future segments directly in pg_wal. I'm not sure
we could reliably predict whether WAL will be recycled in time, so we might
pre-allocate a bit more than necessary, but that's not too terrible. My
"pooling" approach was intended to keep the pre-allocation to a minimum
(IME you really only need a couple at any given time) and to avoid the
guesswork involved in predicting.

>> That being said, it would be nice to avoid the fsync() overhead to move a
>> pre-allocated WAL into place. My first instinct is that would be
>> substantially more complicated and may not actually improve matters all
>> that much, but I agree that it's worth exploring.
>
> FWIW, I've seen the fsyncs around recycling being a rather substantial
> bottleneck. To the point of the main benefit of larger segments being the
> reduction in number of fsyncs at the end of a checkpoint. I think we should
> be able to make the fsyncs a lot more efficient by batching them, first rename
> a bunch of files, then fsync them and the directory. The current pattern
> bascially requires a separate filesystem jouranl flush for each WAL segment.

+1, these kinds of fsync() patterns should be fixed.

--
nathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-01-22 15:56:33 Re: Pre-allocating WAL files
Previous Message Alexander Kuzmenkov 2025-01-22 15:44:01 Quadratic planning time for ordered paths over partitioned tables