From: | Pankaj Raghav <kernel(at)pankajraghav(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, p(dot)raghav(at)samsung(dot)com, mcgrof(at)kernel(dot)org, gost(dot)dev(at)samsung(dot)com |
Subject: | Re: Large block sizes support in Linux |
Date: | 2024-03-25 15:06:04 |
Message-ID: | 97e72ccb-193f-43e0-97ac-17359c1c874b@pankajraghav.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 23/03/2024 03:41, Bruce Momjian wrote:
> On Fri, Mar 22, 2024 at 10:31:11PM +0100, Tomas Vondra wrote:
>> Right, but things change over time - current storage devices support
>> much larger sectors (LBA format), usually 4K. And if you do I/O with
>> this size, it's usually atomic.
>>
>> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
>> format, that would not need full-page writes - we always do I/O in 4k
>> pages, and block layer does I/O (during writeback from page cache) with
>> minimum guaranteed size = logical block size. 4K are great for OLTP
>> systems in general, it'd be even better if we didn't need to worry about
>> torn pages (but the tricky part is to be confident it's safe to disable
>> them on a particular system).
>
> Yes, even if the file system is 8k, and the storage is 8k, we only know
> that torn pages are impossible if the file system never overwrites
> existing 8k pages, but writes new ones and then makes it active. I
> think ZFS does that to handle snapshots.
>
I think we can also avoid torn writes:
- if filesystem's data path always writes in multiples of 8k (with alignment)
- device supports 8k atomic writes.
Then we might be able to push the responsibility to the device without having the overhead
of a CoW FS or FPW=on. Of course, the performance here depends on the vendor specific
implementation of atomics.
We are trying to enable the former by adding LBS support to XFS in Linux.
--
Pankaj
From | Date | Subject | |
---|---|---|---|
Next Message | Amonson, Paul D | 2024-03-25 15:06:16 | RE: Popcount optimization using AVX512 |
Previous Message | Tom Lane | 2024-03-25 14:53:12 | Re: Add bump memory context type and use it for tuplesorts |