Re: Initdb-time block size specification

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>
Cc: David Christensen <david(dot)christensen(at)crunchydata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Initdb-time block size specification
Date: 2023-06-30 23:13:31
Message-ID: c356e5b7-ab37-845f-04cb-1f5649a9c673@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/1/23 00:59, Andres Freund wrote:
> On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote:
>> On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:
>>> On 6/30/23 23:53, Bruce Momjian wrote:
>>>> For a 4kB write, to say it is not partially written would be to require
>>>> the operating system to guarantee that the 4kB write is not split into
>>>> smaller writes which might each be atomic because smaller atomic writes
>>>> would not help us.
>>>
>>> Right, that's the dance we do to protect against torn pages. But Andres
>>> suggested that if you have modern storage and configure it correctly,
>>> writing with 4kB pages would be atomic. So we wouldn't need to do this
>>> FPI stuff, eliminating pretty significant source of write amplification.
>>
>> I agree the hardware is atomic for 4k writes, but do we know the OS
>> always issues 4k writes?
>
> When using a sector size of 4K you *can't* make smaller writes via normal
> paths. The addressing unit is in sectors. The details obviously differ between
> storage protocol, but you pretty much always just specify a start sector and a
> number of sectors to be operated on.
>
> Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write
> 4k back, but that shouldn't be a danger here. There might also be debug
> interfaces to allow reading/writing in different increments, but that'd not be
> something happening during normal operation.

I think it's important to point out that there's a physical and logical
sector size. The "physical" is what the drive does internally, "logical"
defines what OS does.

Some drives have 4k physical sectors but only 512B logical sectors.
AFAIK most "old" SATA SSDs do it that way, for compatibility reasons.

New drives may have 4k physical sectors but typically support both 512B
and 4k logical sectors - my nvme SSDs do this, for example.

My understanding is that for drives with 4k physical+logical sectors,
the OS would only issue "full" 4k writes.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-06-30 23:16:18 Re: Initdb-time block size specification
Previous Message Andres Freund 2023-06-30 23:04:57 Re: Initdb-time block size specification