Re: Multi-xacts and our process problem

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-xacts and our process problem
Date: 2015-05-12 17:36:05
Message-ID: CAM3SWZS1HmWad3D3LLcBuVehc2B59Lb_9eNFZM0CGOVBj8aspQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 12, 2015 at 6:00 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think that's rather facile, and I really don't see how you would
> know that from looking at those release notes. I thought multixacts
> had risk, but obviously nobody came close to predicting how bad things
> were going to be. If they had, I'm pretty sure we would have pulled
> the patch. The fact that the 9.4 btree changes weren't equally
> destabilizing doesn't mean that they weren't risky. There was a risk
> that the Cuban missile crisis would start a nuclear war; in the end,
> it didn't, but that doesn't mean there was no risk.

I think you go on to make my argument for me, here. The fklocks patch
was particularly big and complicated, and slipped 9.2, and everyone
was more or less obligated to use it with their existing application.
It was not difficult to imagine that it was *the* highest risk item.
That wasn't a particularly useful observation at that point - I don't
think it made anyone very introspective about MultiXacts. My point, of
course, is that it was a concern about relative risk, as opposed to
absolute risk, and there's not that much you can do with that -
something has to be #1.

> Part of what went wrong with multixacts is neither Alvaro nor anyone
> who reviewed the patch gave adequate thought to the vacuum
> requirements. There was a whole series of things that needed to be
> done there which just weren't done. I think if it had been realized
> how much work remained to do there, and how necessary it was for every
> single bit of machinery that we have for freezing xmin to also exist
> for freezing xmax, we would not have gone forward. Conceptual
> failures, where there is a whole class of work that you just don't
> even realize needs to be done, are much more damaging than mechanical
> errors, where you realize that something needs to be done but you
> don't do it correctly.

I agree, but no one really knew this at the time. Despite this,
everyone still would have identified fklocks as the highest risk item,
and indeed, some actually did. It's relatively easy to say that
something is the highest risk item in an anonymous poll. That's what
makes it easy to not take it seriously.

> Another crucial difference between the multixact patch and many other
> patches is that it wasn't a feature you could turn off. For example,
> if BRIN has bugs, you can almost certainly avoid hitting them by not
> using BRIN. And many people won't, so even if the feature turns out
> to be horrifically buggy, 90%+ of our users will not even notice.
> ALTER TABLE .. SET LOGGED/UNLOGGED may easily have bugs that eat your
> data, but if you don't use it, then you won't be affected. Of the
> major user-visible features committed to 9.5 that could hose our users
> more broadly, I'd put RLS and UPSERT pretty high on the list. We
> might be lucky enough that any breakage there is confined to users of
> those features, but the code is not as contained as it is for
> something like BRIN, so there is a risk of breaking other stuff.

I think that the chances of UPSERT seriously affecting those that
don't use it are extremely low. For those that use the feature, we
haven't repeated the mistakes of Multixacts: the on-disk
representation of tuples that are committed is always identical to the
historic representation of ordinary tuples, because speculative
insertions are explicitly "confirmed". VACUUM does not need to care.

> Departing from what's user-visible, Heikki's WAL format changes could
> break recovery badly for everyone and we could just be screwed. That
> risk is particularly acute because we really can't change the WAL
> format once the release is shipped. If it's broken, we're probably in
> big trouble. Multixacts, too, fell into this category of things that
> cannot be turned off: they touched the heap storage format, and anyone
> who used foreign keys (which is nearly everyone) really had no choice
> but to use them.

It seems like you're just saying that because it's a complicated patch
that touches the WAL format. It's not a specific concern, and it's not
a concern about a systematic defect or "conceptual failure", as you
put it. That makes it of limited value - you can't hold up progress
because of a very vague concern like that.

> All of these things combined in an explosive fashion. If the patch
> had been simple enough to be broadly understandable, or if it had been
> something that could plausibly have come with an "off" switch, or if
> anyone had realized that there were whole areas that had not been
> thought through carefully, the consequences would have been much less
> serious.

Agreed.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-05-12 17:38:15 Re: BRIN range operator class
Previous Message Amit Kapila 2015-05-12 17:30:51 Re: pgsql: Map basebackup tablespaces using a tablespace_map file