Re: Relation bulk write facility

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Relation bulk write facility
Date: 2024-07-01 23:24:48
Message-ID: 20240701232448.c9.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
> On 01/07/2024 23:52, Noah Misch wrote:
> > Commit 8af2565 wrote:
> > > --- /dev/null
> > > +++ b/src/backend/storage/smgr/bulk_write.c
> >
> > > +/*
> > > + * Finish bulk write operation.
> > > + *
> > > + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
> > > + * the relation if needed.
> > > + */
> > > +void
> > > +smgr_bulk_finish(BulkWriteState *bulkstate)
> > > +{
> > > + /* WAL-log and flush any remaining pages */
> > > + smgr_bulk_flush(bulkstate);
> > > +
> > > + /*
> > > + * When we wrote out the pages, we passed skipFsync=true to avoid the
> > > + * overhead of registering all the writes with the checkpointer. Register
> > > + * the whole relation now.
> > > + *
> > > + * There is one hole in that idea: If a checkpoint occurred while we were
> > > + * writing the pages, it already missed fsyncing the pages we had written
> > > + * before the checkpoint started. A crash later on would replay the WAL
> > > + * starting from the checkpoint, therefore it wouldn't replay our earlier
> > > + * WAL records. So if a checkpoint started after the bulk write, fsync
> > > + * the files now.
> > > + */
> > > + if (!SmgrIsTemp(bulkstate->smgr))
> > > + {
> >
> > Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
> > decision is irrelevant to the !wal case. Either we don't need fsync at all
> > (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
>
> The point of GetRedoRecPtr() is to detect if a checkpoint has started
> concurrently. It works for that purpose whether or not the bulk load is
> WAL-logged. It is not compared with the LSNs of WAL records written by the
> bulk load.

I think the significance of start_RedoRecPtr is it preceding all records
needed to recreate the bulk write. If start_RedoRecPtr==GetRedoRecPtr() and
we crash after commit, we're indifferent to whether the rel gets synced at a
checkpoint before that crash or rebuilt from WAL after that crash. If
start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
deleted, so only smgrimmedsync() suffices. Overall, while it is not compared
with LSNs in WAL records, it's significant only to the extent that such a WAL
record exists. What am I missing?

> Unlogged tables do need to be fsync'd. The scenario is:
>
> 1. Bulk load an unlogged table.
> 2. Shut down Postgres cleanly
> 3. Pull power plug from server, and restart.
>
> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
> thread [1]. I had already forgotten about that; that bug actually still
> exists in back branches, and we should fix it..
>
> [1] https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi

Ah, that's right. I agree this code suffices for unlogged. As a further
optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
always call smgrregistersync(). (For any rel, smgrimmedsync() improves on
smgrregistersync() only if we fail to reach the shutdown checkpoint. Without
a shutdown checkpoint, unlogged rels get reset anyway.)

> > I don't see any functional problem, but this likely arranges for an
> > unnecessary sync when a checkpoint starts between mdcreate() and
> > here. (The mdcreate() sync may also be unnecessary, but that's
> > longstanding.)
> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
> fsyncing it, and in that case we do still need the fsync.

We do not need the fsync in the "WAL-log the relation instead" case; see
https://postgr.es/m/20230921062210.GA110358@rfd.leadboat.com

So maybe like this:

if (use_wal) /* includes init forks */
current logic;
else if (unlogged)
smgrregistersync;
/* else temp || (permanent && wal_level=minimal): nothing to do */

> Fortunately, fsync() on a file that's already flushed to disk is pretty
> cheap.

Yep. I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL. A
comment could be another way to fix that, though.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2024-07-02 00:07:57 Re: Parallel CREATE INDEX for GIN indexes
Previous Message Noah Misch 2024-07-01 23:03:52 Re: Built-in CTYPE provider