data corruption hazard in reorderbuffer.c

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: data corruption hazard in reorderbuffer.c
Date: 2021-07-15 20:03:05
Message-ID: 4CBA7034-3E44-48D1-9F73-20080B76E3C2@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

I believe there is a hazard in reorderbuffer.c if a call to write() buffers data rather than flushing it to disk, only to fail when flushing the data during close(). The relevant code is in ReorderBufferSerializeTXN(), which iterates over changes for a transaction, opening transient files as needed for the changes to be written, writing the transaction changes to the transient files using ReorderBufferSerializeChange(), and closing the files when finished using CloseTransientFile(), the return code from which is ignored. If ReorderBufferSerializeChange() were fsync()ing the writes, then ignoring the failures on close() would likely be ok, or at least in line with what we do elsewhere. But as far as I can see, no call to sync the file is performed.

I expect a failure here could result in a partially written change in the transient file, perhaps preceded or followed by additional complete or partial changes. Perhaps even worse, a failure could result in a change not being written at all (rather than partially), potentially with preceding and following changes written intact, with no indication that one change is missing.

Upon testing, both of these expectations appear to be true. Skipping some writes while not others easily creates a variety of failures, and for brevity I won't post a patch to demonstrate that here. The following code change causes whole rather than partial changes to be written or skipped. After applying this change and running the tests in contrib/test_decoding/, "test toast" shows that an UPDATE command has been silently skipped.

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 7378beb684..a6c60b81c9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -108,6 +108,7 @@
#include "utils/rel.h"
#include "utils/relfilenodemap.h"

+static bool lose_writes_until_close = false;

/* entry for a hash table we use to map from xid to our transaction state */
typedef struct ReorderBufferTXNByIdEnt
@@ -3523,6 +3524,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
Size spilled = 0;
Size size = txn->size;

+ lose_writes_until_close = false;
+
elog(DEBUG2, "spill %u changes in XID %u to disk",
(uint32) txn->nentries_mem, txn->xid);

@@ -3552,7 +3555,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
char path[MAXPGPATH];

if (fd != -1)
+ {
CloseTransientFile(fd);
+ lose_writes_until_close = !lose_writes_until_close;
+ }

XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);

@@ -3600,6 +3606,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)

if (fd != -1)
CloseTransientFile(fd);
+
+ lose_writes_until_close = false;
}

/*
@@ -3790,7 +3798,10 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,

errno = 0;
pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
- if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
+
+ if (lose_writes_until_close)
+ ; /* silently do nothing with buffer contents */
+ else if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
{
int save_errno = errno;

The attached patch catches the failures on close(), but to really fix this properly, we should call pg_fsync() before close().

Any thoughts on this? It seems to add a fair amount of filesystem burden to add all the extra fsync activity, though I admit to having not benchmarked that yet. Perhaps doing something like Thomas's work for commit dee663f7843 where closing files is delayed so that fewer syncs are needed? I'm not sure how much that would help here, and would like feedback before pursuing anything of that sort.

The relevant code exists back as far as the 9_4_STABLE branch, where commit b89e151054a from 2014 first appears.

Attachment Content-Type Size
v1-0001-No-longer-ignoring-failures-on-file-close.patch application/octet-stream 2.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2021-07-15 20:24:17 Re: pg_upgrade does not upgrade pg_stat_statements properly
Previous Message Daniel Gustafsson 2021-07-15 19:46:36 Re: Avoid repeated PQfnumber() in pg_dump