Re: In-placre persistance change of a relation

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: michael(at)paquier(dot)xyz, nathandbossart(at)gmail(dot)com, postgres(at)jeltef(dot)nl, smithpb2250(at)gmail(dot)com, vignesh21(at)gmail(dot)com, jakub(dot)wartak(at)enterprisedb(dot)com, stark(dot)cfm(at)gmail(dot)com, barwick(at)gmail(dot)com, jchampion(at)timescale(dot)com, pryzby(at)telsasoft(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, rjuju123(at)gmail(dot)com, jakub(dot)wartak(at)tomtom(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, bharath(dot)rupireddyforpostgres(at)gmail(dot)com
Subject: Re: In-placre persistance change of a relation
Date: 2024-10-31 21:24:36
Message-ID: 1f201ea8-b1e3-4606-9525-c5817e651cda@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31/10/2024 10:01, Kyotaro Horiguchi wrote:
> After some delays, here’s the new version. In this update, UNDO logs
> are WAL-logged and processed in memory under most conditions. During
> checkpoints, they’re flushed to files, which are then read when a
> specific XID’s UNDO log is accessed for the first time during
> recovery.
>
> The biggest changes are in patches 0001 through 0004 (equivalent to
> the previous 0001-0002). After that, there aren’t any major
> changes. Since this update involves removing some existing features,
> I’ve split these parts into multiple smaller identity transformations
> to make them clearer.
>
> As for changes beyond that, the main one is lifting the previous
> restriction on PREPARE for transactions after a persistence
> change. This was made possible because, with the shift to in-memory
> processing of UNDO logs, commit-time crash recovery detection is now
> simpler. Additional changes include completely removing the
> abort-handling portion from the pendingDeletes mechanism (0008-0010).

In this patch version, the undo log is kept in dynamic shared memory. It
can grow indefinitely. On a checkpoint, it's flushed to disk.

If I'm reading it correctly, the undo records are kept in the DSA area
even after it's flushed to disk. That's not necessary; system never
needs to read the undo log unless there's a crash, so there's no need to
keep it in memory after it's been flushed to disk. That's true today; we
could start relying on the undo log to clean up on abort even when
there's no crash, but I think it's a good design to not do that and rely
on backend-private state for non-crash transaction abort.

I'd suggest doing this the other way 'round. Let's treat the on-disk
representation as the primary representation, not the in-memory one.
Let's use a small fixed-size shared memory area just as a write buffer
to hold the dirty undo log entries that haven't been written to disk
yet. Most transactions are short, so most undo log entries never need to
be flushed to disk, but I think it'll be simpler to think of it that
way. On checkpoint, flush all the buffered dirty entries from memory to
disk and clear the buffer. Also do that if the buffer fills up.

A high-level overview comment of the on-disk format would be nice. If I
understand correctly, there's a magic constant at the beginning of each
undo file, followed by UndoLogRecords. There are no other file headers
and no page structure within the file.

That format seems reasonable. For cross-checking, maybe add the XID to
the file header too. There is a separate CRC value on each record, which
is nice, but not strictly necessary since the writes to the UNDO log are
WAL-logged. The WAL needs CRCs on each record to detect the end of log,
but the UNDO log doesn't need that. Anyway, it's fine.

I somehow dislike the file per subxid design. I'm sure it works, it's
just more of a feeling that it doesn't feel right. I'm somewhat worried
about ending up with lots of files, if you e.g. use temporary tables
with subtransactions heavily. Could we have just one file per top-level
XID? I guess that can become a problem too, if you have a lot of aborted
subtransactions. The UNDO records for the aborted subtransactions would
bloat the undo file. But maybe that's nevertheless better?

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Jacob Champion 2024-10-31 21:10:14 Re: pg_parse_json() should not leak token copies on failure