Re: Multixid hindsight design

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: hlinnaka <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Multixid hindsight design
Date: 2015-05-12 15:14:23
Message-ID: CA+TgmobH9Pi3BR+2qyGm55QR+D_JeJDoNSE6KFyhqHFiDG_bhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 11, 2015 at 5:20 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> The main problem with the infamous multixid changes was that it made
> pg_multixact a permanent, critical, piece of data. Without it, you cannot
> decipher whether some rows have been deleted or not. The 9.3 changes
> uncovered pre-existing issues with vacuuming and wraparound, but the fact
> that multixids are now critical turned those the otherwise relatively
> harmless bugs into data loss.

Agreed.

> We have pg_clog, which is a similar critical data structure. That's a pain
> too - you need VACUUM and you can't easily move tables from one cluster to
> another for example - but we've learned to live with it. But we certainly
> don't need any more such data structures.

Yes.

> So the lesson here is that having a permanent pg_multixact is not nice, and
> we should get rid of it. Here's how to do that:
>
> Looking at the tuple header, the CID and CTID fields are only needed, when
> either xmin or xmax is running. Almost: in a HOT-updated tuple, CTID is
> required even after xmax has committed, but since it's a HOT update, the new
> tuple is always on the same page so you only need the offsetnumber part.
> That leaves us with 8 bytes that are always available for storing
> "ephemeral" information. By ephemeral, I mean that it is only needed when
> xmin or xmax is in-progress. After that, e.g. after a shutdown, it's never
> looked at.
>
> Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed by
> a 64-bit pointer, which means that it never wraps around. That 64-bit
> pointer is stored in the tuple header, in those 8 ephemeral bytes currently
> used for CID and CTID. Whenever a tuple is deleted/updated and locked at the
> same time, a TED entry is created for it, in the new SLRU, and the pointer
> to the entry is put on the tuple. In the TED entry, we can use as many bytes
> as we need to store the ephemeral data. It would include the CID (or
> possibly both CMIN and CMAX separately, now that we have the space), CTID,
> and the locking XIDs. The list of locking XIDs could be stored there
> directly, replacing multixids completely, or we could store a multixid
> there, and use the current pg_multixact system to decode them. Or we could
> store the multixact offset in the TED, replacing the multixact offset SLRU,
> but keep the multixact member SLRU as is.
>
> The XMAX stored on the tuple header would always be a real transaction ID,
> not a multixid. Hence locked-only tuples don't need to be frozen afterwards.
>
> The beauty of this would be that the TED entries can be zapped at restart,
> just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
> WAL-logged, and we are free to change its on-disk layout even in a minor
> release.
>
> Further optimizations are possible. If the TED entry fits in 8 bytes, it can
> be stored directly in the tuple header. Like today, if a tuple is locked but
> not deleted/updated, you only need to store the locker XID, and you can
> store the locking XID directly on the tuple. Or if it's deleted and locked,
> CTID is not needed, only CID and locker XID, so you can store those direcly
> on the tuple. Plus some spare bits to indicate what is stored. And if the
> XMIN is older than global-xmin, you could also steal the XMIN field for
> storing TED data, making it possible to store 12 bytes directly in the tuple
> header. Plus some spare bits again to indicate that you've done that.
>
> Now, given where we are, how do we get there? Upgrade is a pain, because
> even if we no longer generate any new multixids, we'll have to be able to
> decode them after pg_upgrade. Perhaps condense pg_multixact into a simpler
> pg_clog-style bitmap at pg_upgrade, to make it small and simple to read, but
> it would nevertheless be a fair amount of code just to deal with pg_upgraded
> databases.
>
> I think this is worth doing, even after we've fixed all the acute multixid
> bugs, because this would be more robust in the long run. It would also
> remove the need to do anti-wraparound multixid vacuums, and the newly-added
> tuning knobs related to that.

One danger is that in rearranging all of this stuff we may introduce
lots of new bugs. I do agree that making multixacts need to survive a
server crash was not a good idea. We liked freezing xmin so much, we
decided to freeze xmax, too? Uggh. As painful as that's been,
though, we're 18 months into it at this point. If we do another
reorganization, are going to end up back at month 0, where pretty much
everybody had corruption all the time rather than only some people on
some workloads? Maybe not, but it's certainly something to worry
about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-05-12 15:22:38 Re: RFC: Non-user-resettable SET SESSION AUTHORISATION
Previous Message Heikki Linnakangas 2015-05-12 14:33:23 Re: pgsql: Map basebackup tablespaces using a tablespace_map file