Re: Relation bulk write facility

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Relation bulk write facility
Date: 2024-07-02 11:42:50
Message-ID: cac7d1b6-8358-40be-af0b-21bc9b27d34c@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/07/2024 02:24, Noah Misch wrote:
> On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
>> On 01/07/2024 23:52, Noah Misch wrote:
>>> Commit 8af2565 wrote:
>>>> --- /dev/null
>>>> +++ b/src/backend/storage/smgr/bulk_write.c
>>>
>>>> +/*
>>>> + * Finish bulk write operation.
>>>> + *
>>>> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
>>>> + * the relation if needed.
>>>> + */
>>>> +void
>>>> +smgr_bulk_finish(BulkWriteState *bulkstate)
>>>> +{
>>>> + /* WAL-log and flush any remaining pages */
>>>> + smgr_bulk_flush(bulkstate);
>>>> +
>>>> + /*
>>>> + * When we wrote out the pages, we passed skipFsync=true to avoid the
>>>> + * overhead of registering all the writes with the checkpointer. Register
>>>> + * the whole relation now.
>>>> + *
>>>> + * There is one hole in that idea: If a checkpoint occurred while we were
>>>> + * writing the pages, it already missed fsyncing the pages we had written
>>>> + * before the checkpoint started. A crash later on would replay the WAL
>>>> + * starting from the checkpoint, therefore it wouldn't replay our earlier
>>>> + * WAL records. So if a checkpoint started after the bulk write, fsync
>>>> + * the files now.
>>>> + */
>>>> + if (!SmgrIsTemp(bulkstate->smgr))
>>>> + {
>>>
>>> Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
>>> decision is irrelevant to the !wal case. Either we don't need fsync at all
>>> (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
>>
>> The point of GetRedoRecPtr() is to detect if a checkpoint has started
>> concurrently. It works for that purpose whether or not the bulk load is
>> WAL-logged. It is not compared with the LSNs of WAL records written by the
>> bulk load.
>
> I think the significance of start_RedoRecPtr is it preceding all records
> needed to recreate the bulk write. If start_RedoRecPtr==GetRedoRecPtr() and
> we crash after commit, we're indifferent to whether the rel gets synced at a
> checkpoint before that crash or rebuilt from WAL after that crash. If
> start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
> deleted, so only smgrimmedsync() suffices. Overall, while it is not compared
> with LSNs in WAL records, it's significant only to the extent that such a WAL
> record exists. What am I missing?

You're right. You pointed out below that we don't need to register or
immediately fsync the relation if it was not WAL-logged, I missed that.

In the alternative universe that we did need to fsync() even in !use_wal
case, the point of the start_RedoRecPtr==GetRedoRecPtr() was to detect
whether the last checkpoint "missed" fsyncing the files that we wrote.
But the point is moot now.

>> Unlogged tables do need to be fsync'd. The scenario is:
>>
>> 1. Bulk load an unlogged table.
>> 2. Shut down Postgres cleanly
>> 3. Pull power plug from server, and restart.
>>
>> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
>> thread [1]. I had already forgotten about that; that bug actually still
>> exists in back branches, and we should fix it..
>>
>> [1] https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi
>
> Ah, that's right. I agree this code suffices for unlogged. As a further
> optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
> always call smgrregistersync(). (For any rel, smgrimmedsync() improves on
> smgrregistersync() only if we fail to reach the shutdown checkpoint. Without
> a shutdown checkpoint, unlogged rels get reset anyway.)
>
>>> I don't see any functional problem, but this likely arranges for an
>>> unnecessary sync when a checkpoint starts between mdcreate() and
>>> here. (The mdcreate() sync may also be unnecessary, but that's
>>> longstanding.)
>> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
>> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
>> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
>> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
>> fsyncing it, and in that case we do still need the fsync.
>
> We do not need the fsync in the "WAL-log the relation instead" case; see
> https://postgr.es/m/20230921062210.GA110358@rfd.leadboat.com

Ah, true, I missed that log_newpage_range() loads the pages to the
buffer cache and dirties them. That kinds of sucks actually, I wish it
didn't need to dirty the buffers.

> So maybe like this:
>
> if (use_wal) /* includes init forks */
> current logic;
> else if (unlogged)
> smgrregistersync;
> /* else temp || (permanent && wal_level=minimal): nothing to do */

Makes sense, except that we cannot distinguish between unlogged
relations and permanent relations with !use_wal here.

It would be nice to have relpersistence flag in SMgrRelation. I remember
wanting to have that before, although I don't remember what the context
was exactly.

>> Fortunately, fsync() on a file that's already flushed to disk is pretty
>> cheap.
>
> Yep. I'm more concerned about future readers wondering why the function is
> using LSNs to decide what to do about data that doesn't appear in WAL. A
> comment could be another way to fix that, though.

Agreed, this is all very subtle, and deserves a good comment. What do
you think of the attached?

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
0001-Relax-fsyncing-at-end-of-bulk-load-that-was-not-WAL-.patch text/x-patch 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2024-07-02 11:44:18 Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Previous Message Ranier Vilela 2024-07-02 11:07:42 Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)