From: | Hannu Krosing <hannuk(at)google(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, Theodore Tso <tytso(at)google(dot)com> |
Cc: | Ritu Bhandari <mailritubhandari(at)gmail(dot)com>, Andy Fan <zhihuifan1213(at)163(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Purpose of wal_init_zero |
Date: | 2025-01-20 11:06:45 |
Message-ID: | CAMT0RQTi_SyuLOWrczDr0bd=qfga_A5rFAZKEP45yKvJa=VaDQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jan 17, 2025 at 10:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
...
> > I see, PG once had fallocate [1] (which was reverted by [2] due to some
> > performance regression concern). The original OSS discussion was in [3].
> > The perf regression was reported in [4]. Looks like this was due to how
> > ext4 handled extents and uninitialized data[5] and that seems to be fixed
> > in [6]. I'll check with Theodore Ts'o to confirm on [6].
> >
> > Could we consider adding back fallocate?
>
> Fallocate doesn't really help unfortunately. On common filesystems (like
> ext4/xfs) it just allocates filespace without zeroing out the underlying
> blocks.
@Theodore Tso - can you confirm that ext4 (and xfs?) does not use the
low-level WRITE ZEROS commands for initializing the newly allocated
blocks?
And that the new blocks will be written twice - once for zero-filling
and then with the actual data .
For WAL we really don't need to zero out anything - we already do WAL
file recycling without zero-filling the recycled segments, so
obviously it is all right to have random garbage in the pages.
> To make that correct, those filesystems keep a bitmap indicating which
> blocks in the range are not yet written. Unfortunately updating those blocks
> is a metadata operation and thus requires journaling.
>
> I've seen some mild speedups by first using fallocate and then zeroing out the
> file, particularly with larger segment sizes.
Did you just write a single zero page per file page to avoid
duplicating the work ?
> I think mainly due to avoiding
> delayed allocation in the filesystem, rather than actually reducing
> fragmentation. But it really isn't a whole lot.
>
> I've in the past tried to get the linux filesytem developers to add an
> fallocate mode that doesn't utilize the "unwritten extents" "optimization",
> but didn't have luck with that.
Are you saying that the first write to a newly allocated empty block
currently will do two writes to the disk - first writing the zeros and
then writing the actual data written ?
Or just that the overhead from journalling the change to
not-yet-written bitmap cancels out the win from not writing the page
twice ?
> The block layer in linux actually does have
> support for zeroing out regions of blocks without having to do actually write
> the data, but it's only used in some narrow cases (don't remember the
> details).
For WAL files we should be ok by either using the declarative no-write
zero fill in the block layer, or just using the pages as-is without
any zero-filling at all (though this is likely not possible because of
required Linux filesystem semantics)
> Greetings,
>
> Andres Freund
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Bertrand Drouvot | 2025-01-20 11:10:40 | Re: per backend I/O statistics |
Previous Message | Jim Jones | 2025-01-20 11:02:36 | Re: XMLDocument (SQL/XML X030) |