Re: Add 64-bit XIDs into PostgreSQL 15

From: Andres Freund <andres(at)anarazel(dot)de>
To: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
Cc: Ilya Anfimov <ilan(at)tzirechnoy(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add 64-bit XIDs into PostgreSQL 15
Date: 2022-01-28 22:43:07
Message-ID: 20220128224307.f2h3aebujskzjwcl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-01-24 16:38:54 +0400, Pavel Borisov wrote:
> +64-bit Transaction ID's (XID)
> +=============================
> +
> +A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent
> +wraparound every N/2 transactions. This causes performance degradation due
> +to the need to exclusively lock tables while being vacuumed. In each
> +wraparound cycle, SLRU buffers are also being cut.

What exclusive lock?

> +"Double XMAX" page format
> +---------------------------------
> +
> +At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL
> +version pd_special area with a size of 16 bytes should be added to a page.
> +Though a page may not have space for this. Then it can be converted to a
> +temporary format called "double XMAX".
>
> +All tuples after pg-upgrade would necessarily have xmin = FrozenTransactionId.

Why would a tuple after pg-upgrade necessarily have xmin =
FrozenTransactionId? A pg_upgrade doesn't scan the tables, so the pg_upgrade
itself doesn't do anything to xmins.

I guess you mean that the xmin cannot be needed anymore, because no older
transaction can be running?

> +In-memory tuple format
> +----------------------
> +
> +In-memory tuple representation consists of two parts:
> +- HeapTupleHeader from disk page (contains all heap tuple contents, not only
> +header)
> +- HeapTuple with additional in-memory fields
> +
> +HeapTuple for each tuple in memory stores t_xid_base/t_multi_base - a copies of
> +page's pd_xid_base/pd_multi_base. With tuple's 32-bit t_xmin and t_xmax from
> +HeapTupleHeader they are used to calculate actual 64-bit XMIN and XMAX:
> +
> +XMIN = t_xmin + t_xid_base. (3)
> +XMAX = t_xmax + t_xid_base/t_multi_base. (4)

What identifies a HeapTuple as having this additional data?

> +The downside of this is that we can not use tuple's XMIN and XMAX right away.
> +We often need to re-read t_xmin and t_xmax - which could actually be pointers
> +into a page in shared buffers and therefore they could be updated by any other
> +backend.

Ugh, that's not great.

> +Upgrade from 32-bit XID versions
> +--------------------------------
> +
> +pg_upgrade doesn't change pages format itself. It is done lazily after.
> +
> +1. At first heap page read, tuples on a page are repacked to free 16 bytes
> +at the end of a page, possibly freeing space from dead tuples.

That will cause a *massive* torrent of writes after an upgrade. Isn't this
practically making pg_upgrade useless? Imagine a huge cluster where most of
the pages are all-frozen, upgraded using link mode.

What happens if the first access happens on a replica?

What is the approach for dealing with multixact files? They have xids
embedded? And currently the SLRUs will break if you just let the offsets SLRU
grow without bounds.

> +void
> +convert_page(Relation rel, Page page, Buffer buf, BlockNumber blkno)
> +{
> + PageHeader hdr = (PageHeader) page;
> + GenericXLogState *state = NULL;
> + Page tmp_page = page;
> + uint16 checksum;
> +
> + if (!rel)
> + return;
> +
> + /* Verify checksum */
> + if (hdr->pd_checksum)
> + {
> + checksum = pg_checksum_page((char *) page, blkno);
> + if (checksum != hdr->pd_checksum)
> + ereport(ERROR,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("page verification failed, calculated checksum %u but expected %u",
> + checksum, hdr->pd_checksum)));
> + }
> +
> + /* Start xlog record */
> + if (!XactReadOnly && XLogIsNeeded() && RelationNeedsWAL(rel))
> + {
> + state = GenericXLogStart(rel);
> + tmp_page = GenericXLogRegisterBuffer(state, buf, GENERIC_XLOG_FULL_IMAGE);
> + }
> +
> + PageSetPageSizeAndVersion((hdr), PageGetPageSize(hdr),
> + PG_PAGE_LAYOUT_VERSION);
> +
> + if (was_32bit_xid(hdr))
> + {
> + switch (rel->rd_rel->relkind)
> + {
> + case 'r':
> + case 'p':
> + case 't':
> + case 'm':
> + convert_heap(rel, tmp_page, buf, blkno);
> + break;
> + case 'i':
> + /* no need to convert index */
> + case 'S':
> + /* no real need to convert sequences */
> + break;
> + default:
> + elog(ERROR,
> + "Conversion for relkind '%c' is not implemented",
> + rel->rd_rel->relkind);
> + }
> + }
> +
> + /*
> + * Mark buffer dirty unless this is a read-only transaction (e.g. query
> + * is running on hot standby instance)
> + */
> + if (!XactReadOnly)
> + {
> + /* Finish xlog record */
> + if (XLogIsNeeded() && RelationNeedsWAL(rel))
> + {
> + Assert(state != NULL);
> + GenericXLogFinish(state);
> + }
> +
> + MarkBufferDirty(buf);
> + }
> +
> + hdr = (PageHeader) page;
> + hdr->pd_checksum = pg_checksum_page((char *) page, blkno);
> +}

Wait. So you just modify the page without WAL logging or marking it dirty on a
standby? I fail to see how that can be correct.

Imagine the cluster is promoted, the page is dirtied, and we write it
out. You'll have written out a completely changed page, without any WAL
logging. There's plenty other scenarios.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-01-28 23:19:48 Re: support for MERGE
Previous Message Zhihong Yu 2022-01-28 22:32:37 Re: support for MERGE