Re: Pre-allocating WAL files

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Pre-allocating WAL files
Date: 2025-01-21 16:23:06
Message-ID: vfhyldq3rtnwd4uxbb7ro2opmvbmfswvkycavmzcqb55mmo4jk@g2d2n6xnwicb
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-01-21 10:13:14 -0600, Nathan Bossart wrote:
> On Tue, Jan 21, 2025 at 09:52:51AM -0600, Nathan Bossart wrote:
> > On Tue, Jan 21, 2025 at 03:31:27AM +0000, Andy Fan wrote:
> >> 3. Why is the purpose of preallocated_segments directory? what in my
> >> mind is we just prellocate the normal filename so that XLogWrite could
> >> open it directly. This is same as what wal_recycle does and we can reuse
> >> the same strategy to clean up them if they are not needed anymore.
> >
> > The purpose is to limit the use of pre-allocated segments to only
> > situations where WAL recycling is not sufficient. Basically, if writing a
> > record would require a new segment to be created, we can quickly pull a
> > pre-allocated one instead of creating it ourselves. Besides simplifying
> > matters, this prevents a lot of unnecessary pre-allocation, since many
> > workloads will almost never need anything beyond the recycled segments.

I don't really understand that argument - we should be able to predict rather
precisely whether we need to preallocate or not. We have the recent WAL "fill
rate", we know the end of the WAL and we can easily track how far ahead of the
current point we have allocated. Why preallocate when we have a large reserve
of "future" segments? Why preallocate in a separate directory when we have no
future segments?

> That being said, it would be nice to avoid the fsync() overhead to move a
> pre-allocated WAL into place. My first instinct is that would be
> substantially more complicated and may not actually improve matters all
> that much, but I agree that it's worth exploring.

FWIW, I've seen the fsyncs around recycling being a rather substantial
bottleneck. To the point of the main benefit of larger segments being the
reduction in number of fsyncs at the end of a checkpoint. I think we should
be able to make the fsyncs a lot more efficient by batching them, first rename
a bunch of files, then fsync them and the directory. The current pattern
bascially requires a separate filesystem jouranl flush for each WAL segment.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2025-01-21 16:31:53 Re: Enhancing Memory Context Statistics Reporting
Previous Message Corey Huinker 2025-01-21 16:21:58 Re: Statistics Import and Export