Re: Relation bulk write facility

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Relation bulk write facility
Date: 2024-07-01 20:52:50
Message-ID: 20240701205250.62.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

Commit 8af2565 wrote:
> --- /dev/null
> +++ b/src/backend/storage/smgr/bulk_write.c

> +/*
> + * Finish bulk write operation.
> + *
> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
> + * the relation if needed.
> + */
> +void
> +smgr_bulk_finish(BulkWriteState *bulkstate)
> +{
> + /* WAL-log and flush any remaining pages */
> + smgr_bulk_flush(bulkstate);
> +
> + /*
> + * When we wrote out the pages, we passed skipFsync=true to avoid the
> + * overhead of registering all the writes with the checkpointer. Register
> + * the whole relation now.
> + *
> + * There is one hole in that idea: If a checkpoint occurred while we were
> + * writing the pages, it already missed fsyncing the pages we had written
> + * before the checkpoint started. A crash later on would replay the WAL
> + * starting from the checkpoint, therefore it wouldn't replay our earlier
> + * WAL records. So if a checkpoint started after the bulk write, fsync
> + * the files now.
> + */
> + if (!SmgrIsTemp(bulkstate->smgr))
> + {

Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based
decision is irrelevant to the !wal case. Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal). I
don't see any functional problem, but this likely arranges for an unnecessary
sync when a checkpoint starts between mdcreate() and here. (The mdcreate()
sync may also be unnecessary, but that's longstanding.)

> + /*
> + * Prevent a checkpoint from starting between the GetRedoRecPtr() and
> + * smgrregistersync() calls.
> + */
> + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> + MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +
> + if (bulkstate->start_RedoRecPtr != GetRedoRecPtr())
> + {
> + /*
> + * A checkpoint occurred and it didn't know about our writes, so
> + * fsync() the relation ourselves.
> + */
> + MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> + smgrimmedsync(bulkstate->smgr, bulkstate->forknum);
> + elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
> + }
> + else
> + {
> + smgrregistersync(bulkstate->smgr, bulkstate->forknum);
> + MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> + }
> + }
> +}

This is an elegant optimization.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-07-01 21:53:05 Re: Relation bulk write facility
Previous Message Tomas Vondra 2024-07-01 20:20:41 Re: WIP: parallel GiST index builds