Re: Purpose of wal_init_zero

From: Hannu Krosing <hannuk(at)google(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Theodore Tso <tytso(at)google(dot)com>
Cc: Ritu Bhandari <mailritubhandari(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Purpose of wal_init_zero
Date: 2025-01-20 11:17:00
Message-ID: CAMT0RQSinLS7RjRd55m_zRdALordm0fRR09q=iMbBtC3E=M1EQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thinking back I can see now why disabling WAL writes with
wal_level=minimal in COPY resulted in 3X better write performance
instead of expected 2x -

With wal_level=minimal only the heap page writes were needed, whereas
with WAL writes the same page was written 3x - (heap + WAL zero-fill +
WAL).

--
Hannu

On Mon, Jan 20, 2025 at 12:06 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> On Fri, Jan 17, 2025 at 10:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> ...
> > > I see, PG once had fallocate [1] (which was reverted by [2] due to some
> > > performance regression concern). The original OSS discussion was in [3].
> > > The perf regression was reported in [4]. Looks like this was due to how
> > > ext4 handled extents and uninitialized data[5] and that seems to be fixed
> > > in [6]. I'll check with Theodore Ts'o to confirm on [6].
> > >
> > > Could we consider adding back fallocate?
> >
> > Fallocate doesn't really help unfortunately. On common filesystems (like
> > ext4/xfs) it just allocates filespace without zeroing out the underlying
> > blocks.
>
> @Theodore Tso - can you confirm that ext4 (and xfs?) does not use the
> low-level WRITE ZEROS commands for initializing the newly allocated
> blocks?
>
> And that the new blocks will be written twice - once for zero-filling
> and then with the actual data .
>
> For WAL we really don't need to zero out anything - we already do WAL
> file recycling without zero-filling the recycled segments, so
> obviously it is all right to have random garbage in the pages.
>
> > To make that correct, those filesystems keep a bitmap indicating which
> > blocks in the range are not yet written. Unfortunately updating those blocks
> > is a metadata operation and thus requires journaling.
> >
> > I've seen some mild speedups by first using fallocate and then zeroing out the
> > file, particularly with larger segment sizes.
>
> Did you just write a single zero page per file page to avoid
> duplicating the work ?
>
> > I think mainly due to avoiding
> > delayed allocation in the filesystem, rather than actually reducing
> > fragmentation. But it really isn't a whole lot.
> >
> > I've in the past tried to get the linux filesytem developers to add an
> > fallocate mode that doesn't utilize the "unwritten extents" "optimization",
> > but didn't have luck with that.
>
> Are you saying that the first write to a newly allocated empty block
> currently will do two writes to the disk - first writing the zeros and
> then writing the actual data written ?
>
> Or just that the overhead from journalling the change to
> not-yet-written bitmap cancels out the win from not writing the page
> twice ?
>
> > The block layer in linux actually does have
> > support for zeroing out regions of blocks without having to do actually write
> > the data, but it's only used in some narrow cases (don't remember the
> > details).
>
> For WAL files we should be ok by either using the declarative no-write
> zero fill in the block layer, or just using the pages as-is without
> any zero-filling at all (though this is likely not possible because of
> required Linux filesystem semantics)
>
> > Greetings,
> >
> > Andres Freund
> >
> >

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2025-01-20 11:40:03 Re: Psql meta-command conninfo+
Previous Message Bertrand Drouvot 2025-01-20 11:10:40 Re: per backend I/O statistics