Re: Minimal logical decoding on standbys

From: Andres Freund <andres(at)anarazel(dot)de>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, fabriziomello(at)gmail(dot)com, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rahila Syed <rahila(dot)syed(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Minimal logical decoding on standbys
Date: 2023-03-31 04:33:00
Message-ID: 20230331043300.gux3s5wzrapqi4oe@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-03-30 18:23:41 +0200, Drouvot, Bertrand wrote:
> On 3/30/23 9:04 AM, Andres Freund wrote:
> > I think this commit is ready to go. Unless somebody thinks differently, I
> > think I might push it tomorrow.
>
> Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
> use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
> once 0001 is committed).

Unfortunately I did find an issue doing a pre-commit review of the patch.

The patch adds VISIBILITYMAP_IS_CATALOG_REL to xl_heap_visible.flags - but it
does not remove the bit before calling visibilitymap_set().

This ends up corrupting the visibilitymap, because the we'll set a bit for
another page.

It's unfortunate that visibilitymap_set() doesn't assert that just the correct
bits are passed in. It does assert that at least one valid bit is set, but
that's not enough, as this case shows.

I noticed this when looking into the changes to visibilitymapdefs.h in more
detail. I don't like how it ends up in the patch:

> --- a/src/include/access/visibilitymapdefs.h
> +++ b/src/include/access/visibilitymapdefs.h
> @@ -17,9 +17,11 @@
> #define BITS_PER_HEAPBLOCK 2
>
> /* Flags for bit map */
> -#define VISIBILITYMAP_ALL_VISIBLE 0x01
> -#define VISIBILITYMAP_ALL_FROZEN 0x02
> -#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visibilitymap
> - * flags bits */
> +#define VISIBILITYMAP_ALL_VISIBLE 0x01
> +#define VISIBILITYMAP_ALL_FROZEN 0x02
> +#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid visibilitymap
> + * flags bits */
> +#define VISIBILITYMAP_IS_CATALOG_REL 0x04 /* to handle recovery conflict during logical
> + * decoding on standby */

On a casual read, one very well might think that VISIBILITYMAP_IS_CATALOG_REL
is a valid bit that could be set in the VM.

I am thinking of instead creating a separate namespace for the "xlog only"
bits:

/*
* To detect recovery conflicts during logical decoding on a standby, we need
* to know if a table is a user catalog table. For that we add an additional
* bit into xl_heap_visible.flags, in addition to the above.
*
* NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
*/
#define VISIBILITYMAP_XLOG_CATALOG_REL 0x04
#define VISIBILITYMAP_XLOG_VALID_BITS (VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)

That allows heap_xlog_visible() to do:

Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
vmbits = (xlrec->flags & VISIBILITYMAP_VALID_BITS);

and pass vmbits istead of xlrec->flags to visibilitymap_set().

I'm also thinking of splitting the patch into two. One patch to pass down the
heap relation into the new places, and another for the rest. As evidenced
above, looking at the actual behaviour changes is important...

Given how the patch changes the struct for XLOG_GIST_DELETE:

> diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
> index 2ce9366277..93fb9d438a 100644
> --- a/src/include/access/gistxlog.h
> +++ b/src/include/access/gistxlog.h
> @@ -51,11 +51,14 @@ typedef struct gistxlogDelete
> {
> TransactionId snapshotConflictHorizon;
> uint16 ntodelete; /* number of deleted offsets */
> + bool isCatalogRel; /* to handle recovery conflict during logical
> + * decoding on standby */
>
> - /* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
> + /* TODELETE OFFSET NUMBERS */
> + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
> } gistxlogDelete;
>
> -#define SizeOfGistxlogDelete (offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
> +#define SizeOfGistxlogDelete offsetof(gistxlogDelete, offsets)

and XLOG_HASH_VACUUM_ONE_PAGE:

> diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
> index 9894ab9afe..6c5535fe73 100644
> --- a/src/include/access/hash_xlog.h
> +++ b/src/include/access/hash_xlog.h
> @@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
> {
> TransactionId snapshotConflictHorizon;
> uint16 ntuples;
> + bool isCatalogRel; /* to handle recovery conflict during logical
> + * decoding on standby */
>
> - /* TARGET OFFSET NUMBERS FOLLOW AT THE END */
> + /* TARGET OFFSET NUMBERS */
> + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
> } xl_hash_vacuum_one_page;
>
> -#define SizeOfHashVacuumOnePage \
> - (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
> +#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)

I don't think the changes are quite sufficient:

for gist:

> @@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
> */
> XLogRecPtr
> gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
> - TransactionId snapshotConflictHorizon)
> + TransactionId snapshotConflictHorizon, Relation heaprel)
> {
> gistxlogDelete xlrec;
> XLogRecPtr recptr;
>
> + xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
> xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
> xlrec.ntodelete = ntodelete;

Note that gistXLogDelete() continues to register data with two different
XLogRegisterData() calls. This will append data without any padding:

XLogRegisterData((char *) &xlrec, SizeOfGistxlogDelete);

/*
* We need the target-offsets array whether or not we store the whole
* buffer, to allow us to find the snapshotConflictHorizon on a standby
* server.
*/
XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));

But replay now uses the new offset member:

> @@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
> gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
> Buffer buffer;
> Page page;
> + OffsetNumber *toDelete = xldata->offsets;
>
> /*
> * If we have any conflict processing to do, it must happen before we

That doesn't look right. If there's any padding before offsets, we'll afaict
read completely bogus data?

As it turns out, there is padding:

struct gistxlogDelete {
TransactionId snapshotConflictHorizon; /* 0 4 */
uint16 ntodelete; /* 4 2 */
_Bool isCatalogRel; /* 6 1 */

/* XXX 1 byte hole, try to pack */

OffsetNumber offsets[]; /* 8 0 */

/* size: 8, cachelines: 1, members: 4 */
/* sum members: 7, holes: 1, sum holes: 1 */
/* last cacheline: 8 bytes */
};

I am frankly baffled how this works at all, this should just about immediately
crash?

Oh, I see. We apparently don't reach the gist deletion code in the tests:
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174

And indeed, if I add an abort() into , it's not reached.

And it's not because tests use a temp table, the caller is also unreachable:
https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643

Whut?

And the same issue exists for hash as well.

Logging:
XLogRegisterData((char *) &xlrec, SizeOfHashVacuumOnePage);

/*
* We need the target-offsets array whether or not we store the
* whole buffer, to allow us to find the snapshotConflictHorizon
* on a standby server.
*/
XLogRegisterData((char *) deletable,
ndeletable * sizeof(OffsetNumber));

Redo:

> xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
> + toDelete = xldata->offsets;

And there also are no tests:
https://coverage.postgresql.org/src/backend/access/hash/hashinsert.c.gcov.html#372

I'm not going to commit a nontrivial change to these WAL records without some
minimal tests.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-03-31 05:07:26 regression coverage gaps for gist and hash indexes
Previous Message David Rowley 2023-03-31 03:54:31 Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode