Recovery vs. RelationTruncate(); skipFsync vs. unlogged rels

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Recovery vs. RelationTruncate(); skipFsync vs. unlogged rels
Date: 2019-11-30 06:05:14
Message-ID: 20191130060514.GA108490@gust.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While reviewing the patches in
http://postgr.es/m/20191126.213752.2132434859202124793.horikyota.ntt@gmail.com
I noticed three related problems. The defects motivating that other thread
are specific to wal_level=minimal. The related problems are not specific to
any wal_level, hence my starting a separate thread.

1. We don't sync unlogged tables after skipFsync=true operations.

smgrwrite() explains the contract for skipFsync=true. Concretely, it causes
md.c to skip enqueueing the relation to be synced during the next checkpoint.
Three places in the tree call smgrextend(..., skipFsync=true) and/or
smgrwrite(..., skipFsync=true):

rewriteheap.c (CLUSTER, VACUUM FULL)
_bt_blwritepage() (CREATE INDEX)
RelationCopyStorage() (ALTER TABLE SET TABLESPACE)

They use logic like this to decide when to sync:

/*
* If the rel is WAL-logged, must fsync before commit. We use heap_sync
* to ensure that the toast table gets fsync'd too. (For a temp or
* unlogged rel we don't care since the data will be gone after a crash
* anyway.)
*
* It's obvious that we must do this when not WAL-logging the copy. It's
* less obvious that we have to do it even if we did WAL-log the copied
* pages. The reason is that since we're copying outside shared buffers, a
* CHECKPOINT occurring during the copy has no way to flush the previously
* written data to disk (indeed it won't know the new rel even exists). A
* crash later on would replay WAL from the checkpoint, therefore it
* wouldn't replay our earlier WAL entries. If we do not fsync those pages
* here, they might still not be on disk when the crash occurs.
*/
if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
smgrimmedsync(dst, forkNum);

However, the reasoning about unlogged rels is incomplete. Normally, we sync
unlogged rels during each checkpoint after we wrote out a dirty buffer, just
as we do for permanent rels. (It would be enough to sync them during the
shutdown checkpoint. That would require a data structure to track unlogged
relation files dirtied since StartupXLOG().) However, due to the code above,
we miss syncing them if one of these DDL operations is the last operation
changing the unlogged rel before a shutdown checkpoint. I've attached a
test-only patch demonstrating the problem. This can cause unlogged tables to
have invalid contents if an OS crash happens with PostgreSQL stopped.

I think the fix is simple: test "relpersistence != RELPERSISTENCE_TEMP"
instead.

2. RelationTruncate() needs defense against concurrent checkpoints.

In RelationTruncate(), nothing stops a checkpoint from starting and completing
after XLogInsert(... XLOG_SMGR_TRUNCATE) and before smgrtruncate(). That
checkpoint could move the redo pointer past the XLOG_SMGR_TRUNCATE record,
making the outcome equivalent to having never written the WAL record. If the
OS crashes before the first post-smgrtruncate() checkpoint, the filesystem may
forget some or all ftruncate() calls. The symptom would be unexpected tuples
in the relation. Example test procedure:

BEGIN;
CREATE TABLE t (c) AS SELECT 1;
-- set breakpoint at XLogFlush()
TRUNCATE t; -- hit breakpoint
-- while stopped at breakpoint, issue CHECKPOINT, then release debugger
COMMIT;
-- hard crash forgets some ftruncate()
pg_ctl -w start # REDO does not reissue ftruncate()

The fix is to set delayChkpt before writing WAL and clear it after
smgrtruncate(), like we do in EndPrepare().

3. smgrtruncate()/mdtruncate() is not qualified to be called in recovery.

smgr_redo() calls smgrtruncate() to replay XLOG_SMGR_TRUNCATE records.
However, mdtruncate() relies on the following md.c invariant:

* On disk, a relation must consist of consecutively numbered segment
* files in the pattern
* -- Zero or more full segments of exactly RELSEG_SIZE blocks each
* -- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks
* -- Optionally, any number of inactive segments of size 0 blocks.

That invariant does not hold before reachedConsistency. Suppose the unclean
shutdown happened after the original (non-recovery) mdtruncate() and before
anything synced the truncated segments. The OS might remember the ftruncate()
of segment zero and lose the ftruncate() of segment one, upsetting the
invariant. The symptom would again be unexpected tuples in the relation.

My preferred fix is to make mdtruncate() truncate each inactive segment,
stopping when opening a segment fails with ENOENT, like mdunlinkfork() does.
An alternative would be to add a last_segment_truncated field to
xl_smgr_truncate, then truncate up to that segment.

Does anyone see a way to improve on those proposed fixes?

Thanks,
nm

Attachment Content-Type Size
ddl-sync-unlogged-test-v0.patch text/x-diff 3.4 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-11-30 07:27:51 Re: pgbench -i progress output on terminal
Previous Message Tom Lane 2019-11-30 03:06:54 Re: Update minimum SSL version