| From: | Peter Geoghegan <pg(at)bowt(dot)ie> | 
|---|---|
| To: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> | 
| Cc: | Maxim Orlov <orlovmg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: Add 64-bit XIDs into PostgreSQL 15 | 
| Date: | 2022-01-06 20:44:52 | 
| Message-ID: | CAH2-Wzk68iW_z0rb8VxEchQavHLPLPXv_Vkx954B=BmqSrL_mQ@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Tue, Jan 4, 2022 at 9:40 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> Could you tell me what happens if new tuple with XID larger than xid_base + 0xFFFFFFFF is inserted into the page? Such new tuple is not allowed to be inserted into that page?
I fear that this patch will have many bugs along these lines. Example:
Why is it okay that convert_page() may have to defragment a heap page,
without holding a cleanup lock? That will subtly break code that holds
a pin on the buffer, when a tuple slot contains a C pointer to a
HeapTuple in shared memory (though only if we get unlucky).
Currently we are very permissive about what XID/backend can (or
cannot) consume a specific piece of free space from a specific heap
page in code like RelationGetBufferForTuple(). It would be very hard
to enforce a policy like "your XID cannot insert onto this particular
heap page" with the current FSM design. (I actually think that we
should have a FSM that supports these requirements, but that's a big
project -- a long term goal of mine [1].)
> Or xid_base and xids of all existing tuples in the page are increased? Also what happens if one of those xids (of existing tuples) cannot be changed because the tuple still can be seen by very-long-running transaction?
That doesn't work in the general case, I think. How could it, unless
we truly had 64-bit XIDs in heap tuple headers? You can't necessarily
freeze to fix the problem, because we can only freeze XIDs that 1.)
committed, and 2.) are visible to every possible MVCC snapshot. (I
think you were alluding to this problem yourself.)
I believe that a good solution to the problem that this patch tries to
solve needs to be more ambitious. I think that we need to return to
first principles, rather than extending what we have already.
Currently, we store XIDs in tuple headers so that we can determine the
tuple's visibility status, based on whether the XID committed (or
aborted), and where our snapshot sees the XID as "in the past" (in the
case of an inserted tuple's xmin). You could say that XIDs from tuple
headers exist so we can see differences *between* tuples.  But these
differences are typically not useful/interesting for very long. 32-bit
XIDs are sometimes not wide enough, but usually they're "too wide":
Why should we need to consider an old XID (e.g. do clog lookups) at
all, barring extreme cases?
Why do we need to keep any kind of metadata about transactions around
for a long time? Postgres has not supported time travel in 25 years!
If we eagerly cleaned-up aborted transactions with a special kind of
VACUUM (which would remove aborted XIDs), we could also maintain a
structure that indicates if all of the XIDs on a heap page are known
"all committed" implicitly (no dirtying the page, no hint bits, etc)
-- something a little like the visibility map, that is mostly set
implicitly (not during VACUUM). That doesn't fix the wraparound
problem itself, of course. But it enables a design that imposes the
same problem on the specific old snapshot instead -- something like a
"snapshot too old" error is much better than a system-wide wraparound
failure. That approach is definitely very hard, and also requires a
smart FSM along the lines described in [1], but it seems like the best
way forward.
As I pointed out already, freezing is bad because it imposes the
requirement that everybody considers an affected XID committed and
visible, which is brittle (e.g., old snapshots can cause wraparound
failure). More generally, we rely too much on explicitly maintaining
"absolute" metadata inline, when we should implicitly maintain
"relative" metadata (that can be discarded quickly and without concern
for old snapshots). We need to be more disciplined about what XIDs can
modify what heap pages in the first place (in code like hio.c and the
FSM) to make all this work.
[1] https://www.postgresql.org/message-id/CAH2-Wz%3DzEV4y_wxh-A_EvKxeAoCMdquYMHABEh_kZO1rk3a-gw%40mail.gmail.com
--
Peter Geoghegan
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2022-01-06 20:49:43 | Re: Make relfile tombstone files conditional on WAL level | 
| Previous Message | Bossart, Nathan | 2022-01-06 20:41:30 | Re: Add index scan progress to pg_stat_progress_vacuum |