Re: Avoiding smgrimmedsync() during nbtree index builds

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2022-01-23 21:55:41
Message-ID: 20220123215541.xxv54akpdcjy5klb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:
> On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> >
> > I have attached a v3 which includes two commits -- one of which
> > implements the directmgr API and uses it and the other which adds
> > functionality to use either directmgr or bufmgr API during index build.
> >
> > Also registering for march commitfest.
>
> Forgot directmgr.h. Attached v4

Are you looking at the failures Justin pointed out? Something isn't quite
right yet. See https://postgr.es/m/20220116202559.GW14051%40telsasoft.com and
the subsequent mail about it also triggering on once on linux.

> Thus, the backend must ensure that
> either the Redo pointer has not moved or that the data is fsync'd before
> freeing the page.

"freeing"?

> This is not a problem with pages written in shared buffers because the
> checkpointer will block until all buffers that were dirtied before it
> began finish before it moves the Redo pointer past their associated WAL
> entries.

> This commit makes two main changes:
>
> 1) It wraps smgrextend() and smgrwrite() in functions from a new API
> for writing data outside of shared buffers, directmgr.
>
> 2) It saves the XLOG Redo pointer location before doing the write or
> extend. It also adds an fsync request for the page to the
> checkpointer's pending-ops table. Then, after doing the write or
> extend, if the Redo pointer has moved (meaning a checkpoint has
> started since it saved it last), then the backend fsync's the page
> itself. Otherwise, it lets the checkpointer take care of fsync'ing
> the page the next time it processes the pending-ops table.

Why combine those two into one commit?

> @@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
> /* Now extend the file */
> while (vm_nblocks_now < vm_nblocks)
> {
> - PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
> -
> - smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
> + // TODO: aren't these pages empty? why checksum them
> + unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);

Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:

/* If we don't need a checksum, just return */
if (PageIsNew(page) || !DataChecksumsEnabled())
return;

OTOH, it seems easier to have it there than to later forget it, when
e.g. adding some actual initial content to the pages during the smgrextend().

> @@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
>
> wstate.heap = btspool->heap;
> wstate.index = btspool->index;
> + wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
> + wstate.ub_wstate.redo = InvalidXLogRecPtr;
> wstate.inskey = _bt_mkscankey(wstate.index, NULL);
> /* _bt_mkscankey() won't set allequalimage without metapage */
> wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
> @@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> if (!wstate->btws_zeropage)
> wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
> /* don't set checksum for all-zero page */
> - smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> - wstate->btws_pages_written++,
> - (char *) wstate->btws_zeropage,
> - true);
> + unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
> }

There's a bunch of long lines in here...

> - /*
> - * When we WAL-logged index pages, we must nonetheless fsync index files.
> - * Since we're building outside shared buffers, a CHECKPOINT occurring
> - * during the build has no way to flush the previously written data to
> - * disk (indeed it won't know the index even exists). A crash later on
> - * would replay WAL from the checkpoint, therefore it wouldn't replay our
> - * earlier WAL entries. If we do not fsync those pages here, they might
> - * still not be on disk when the crash occurs.
> - */
> if (wstate->btws_use_wal)
> - smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> + unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
> }

The API of unbuffered_finish() only sometimes getting called, but
unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps
it'd make sense to pass in whether the relation needs to be synced or not instead?

> spgbuildempty(Relation index)
> {
> Page page;
> + UnBufferedWriteState wstate;
> +
> + wstate.smgr_rel = RelationGetSmgr(index);
> + unbuffered_prep(&wstate);

I don't think that's actually safe, and one of the instances could be the
cause cause of the bug CI is seeing:

* Note: since a relcache flush can cause the file handle to be closed again,
* it's unwise to hold onto the pointer returned by this function for any
* long period. Recommended practice is to just re-execute RelationGetSmgr
* each time you need to access the SMgrRelation. It's quite cheap in
* comparison to whatever an smgr function is going to do.
*/
static inline SMgrRelation
RelationGetSmgr(Relation rel)

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-01-23 22:11:16 Re: pg_basebackup fsyncs some files despite --no-sync (was: Adding CI to our tree)
Previous Message Peter Smith 2022-01-23 21:36:24 Re: row filtering for logical replication