From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, sanyam jain <sanyamjain22(at)live(dot)in>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Setting BLCKSZ 4kB |
Date: | 2018-01-27 11:40:03 |
Message-ID: | 39f9fcb4-33e9-52bd-0c44-aa1b5d2fcd21@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 01/27/2018 05:01 AM, Bruce Momjian wrote:
> On Fri, Jan 26, 2018 at 11:53:33PM +0100, Tomas Vondra wrote:
>>
>> ...
>>
>> FWIW even if it's not save in general, it would be useful to
>> understand what are the requirements to make it work. I mean,
>> conditions that need to be met on various levels (sector size of
>> the storage device, page size of of the file system, filesystem
>> alignment, ...).
>
> I think you are fine as soon the data arrives at the durable
> storage, and assuming the data can't be partially written to durable
> storage. I was thinking more of a case where you have a file system,
> a RAID card without a BBU, and then magnetic disks. In that case,
> even if the file system were to write in 4k chunks, the RAID
> controller would also need to do the same, and with the same
> alignment. Of course, that's probably a silly example since there is
> probably no way to atomically write 4k to a magnetic disk.
>
> Actually, what happens if a 4k write is being written to an SSD and
> the server crashes. Is the entire write discarded?
>
AFAIK it's not possible to end up with a partial write, particularly not
such that would contain a mix of old and new data - that's because SSDs
can't overwrite a block without erasing it first.
So the write should either succeed or fail as a whole, depending on when
exactly the server crashes - it might be right before confirming the
flush back to the client, for example. That assumes the drive has 4kB
sectors (internal pages) - on drives with volatile write cache but
supporting write barriers and cache flushes. On drives with non-volatile
write cache (so with battery/capacitor) it should always succeed and
never get discarded.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Dmitry Dolgov | 2018-01-27 13:20:38 | Write lifetime hints for NVMe |
Previous Message | Erik Rijkers | 2018-01-27 11:08:38 | Re: Add RANGE with values and exclusions clauses to the Window Functions |