From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | Peter Smith <smithpb2250(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: Relation bulk write facility |
Date: | 2024-07-02 11:42:50 |
Message-ID: | cac7d1b6-8358-40be-af0b-21bc9b27d34c@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 02/07/2024 02:24, Noah Misch wrote:
> On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
>> On 01/07/2024 23:52, Noah Misch wrote:
>>> Commit 8af2565 wrote:
>>>> --- /dev/null
>>>> +++ b/src/backend/storage/smgr/bulk_write.c
>>>
>>>> +/*
>>>> + * Finish bulk write operation.
>>>> + *
>>>> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
>>>> + * the relation if needed.
>>>> + */
>>>> +void
>>>> +smgr_bulk_finish(BulkWriteState *bulkstate)
>>>> +{
>>>> + /* WAL-log and flush any remaining pages */
>>>> + smgr_bulk_flush(bulkstate);
>>>> +
>>>> + /*
>>>> + * When we wrote out the pages, we passed skipFsync=true to avoid the
>>>> + * overhead of registering all the writes with the checkpointer. Register
>>>> + * the whole relation now.
>>>> + *
>>>> + * There is one hole in that idea: If a checkpoint occurred while we were
>>>> + * writing the pages, it already missed fsyncing the pages we had written
>>>> + * before the checkpoint started. A crash later on would replay the WAL
>>>> + * starting from the checkpoint, therefore it wouldn't replay our earlier
>>>> + * WAL records. So if a checkpoint started after the bulk write, fsync
>>>> + * the files now.
>>>> + */
>>>> + if (!SmgrIsTemp(bulkstate->smgr))
>>>> + {
>>>
>>> Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
>>> decision is irrelevant to the !wal case. Either we don't need fsync at all
>>> (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
>>
>> The point of GetRedoRecPtr() is to detect if a checkpoint has started
>> concurrently. It works for that purpose whether or not the bulk load is
>> WAL-logged. It is not compared with the LSNs of WAL records written by the
>> bulk load.
>
> I think the significance of start_RedoRecPtr is it preceding all records
> needed to recreate the bulk write. If start_RedoRecPtr==GetRedoRecPtr() and
> we crash after commit, we're indifferent to whether the rel gets synced at a
> checkpoint before that crash or rebuilt from WAL after that crash. If
> start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
> deleted, so only smgrimmedsync() suffices. Overall, while it is not compared
> with LSNs in WAL records, it's significant only to the extent that such a WAL
> record exists. What am I missing?
You're right. You pointed out below that we don't need to register or
immediately fsync the relation if it was not WAL-logged, I missed that.
In the alternative universe that we did need to fsync() even in !use_wal
case, the point of the start_RedoRecPtr==GetRedoRecPtr() was to detect
whether the last checkpoint "missed" fsyncing the files that we wrote.
But the point is moot now.
>> Unlogged tables do need to be fsync'd. The scenario is:
>>
>> 1. Bulk load an unlogged table.
>> 2. Shut down Postgres cleanly
>> 3. Pull power plug from server, and restart.
>>
>> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
>> thread [1]. I had already forgotten about that; that bug actually still
>> exists in back branches, and we should fix it..
>>
>> [1] https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi
>
> Ah, that's right. I agree this code suffices for unlogged. As a further
> optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
> always call smgrregistersync(). (For any rel, smgrimmedsync() improves on
> smgrregistersync() only if we fail to reach the shutdown checkpoint. Without
> a shutdown checkpoint, unlogged rels get reset anyway.)
>
>>> I don't see any functional problem, but this likely arranges for an
>>> unnecessary sync when a checkpoint starts between mdcreate() and
>>> here. (The mdcreate() sync may also be unnecessary, but that's
>>> longstanding.)
>> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
>> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
>> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
>> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
>> fsyncing it, and in that case we do still need the fsync.
>
> We do not need the fsync in the "WAL-log the relation instead" case; see
> https://postgr.es/m/20230921062210.GA110358@rfd.leadboat.com
Ah, true, I missed that log_newpage_range() loads the pages to the
buffer cache and dirties them. That kinds of sucks actually, I wish it
didn't need to dirty the buffers.
> So maybe like this:
>
> if (use_wal) /* includes init forks */
> current logic;
> else if (unlogged)
> smgrregistersync;
> /* else temp || (permanent && wal_level=minimal): nothing to do */
Makes sense, except that we cannot distinguish between unlogged
relations and permanent relations with !use_wal here.
It would be nice to have relpersistence flag in SMgrRelation. I remember
wanting to have that before, although I don't remember what the context
was exactly.
>> Fortunately, fsync() on a file that's already flushed to disk is pretty
>> cheap.
>
> Yep. I'm more concerned about future readers wondering why the function is
> using LSNs to decide what to do about data that doesn't appear in WAL. A
> comment could be another way to fix that, though.
Agreed, this is all very subtle, and deserves a good comment. What do
you think of the attached?
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachment | Content-Type | Size |
---|---|---|
0001-Relax-fsyncing-at-end-of-bulk-load-that-was-not-WAL-.patch | text/x-patch | 3.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dean Rasheed | 2024-07-02 11:44:18 | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. |
Previous Message | Ranier Vilela | 2024-07-02 11:07:42 | Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c) |