Re: [CORE] Restore-reliability mode

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, pgsql-core <pgsql-core(at)postgresql(dot)org>
Subject: Re: [CORE] Restore-reliability mode
Date: 2015-06-06 03:07:08
Message-ID: 20150606030708.GA31950@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 5, 2015 at 04:54:56PM +0100, Simon Riggs wrote:
> On 5 June 2015 at 16:05, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
>
> Please address some of the specific issues I mentioned. 
>
>
> I can discuss them but not because I am involved directly. I take
> responsibility as a committer and have an interest from that perspective.
>
> In my role at 2ndQuadrant, I approved all of the time Alvaro and Andres spent
> on submitting, reviewing and fixing bugs - at this point that has cost
> something close to fifty thousand dollars just on this feature and subsequent
> actions. (I believe the feature was originally funded, but we never saw a penny
> of that, though others did.)

Yes, the burden has fallen heavily on Alvaro. I personally am concerned
that many people were focusing on 9.5 rather than helping him. I think
that was a mistake on our part and we need to take reliability problems
more seriously.

What has also concerned me is that there are so many 9.3/9.4 bugs in
this area that few of us can even understand what was fixed when, and we
are then having problems figuring out what bugs were present when
analyzing bug reports. pg_upgrade has made this worse by allowing
multi-xact bugs to propagate across major versions, and pg_upgrade had
some multi-xact bugs of its own in early 9.3 releases. :-(

> The problem
> with the multi-xact case is that we just kept fixing bugs as people
> found them, and did not do a holistic review of the code. 
>
>
> I observed much discussion and review. The bugs we've had have all been fairly
> straightforwardly fixed. There haven't been any design-level oversights or
> head-palm moments. It's complex software that had complex behaviour that caused
> problems. The problem has been that anything on-disk causes more problems when
> errors occur. We should review carefully anything that alters the way on-disk
> structures work, like the WAL changes, UPSERTs new mechanism etc..

Agreed. However, I think a thorough review early on could have caught
many of these bugs before they were reported by users. As proof, even
in the past few weeks, review is finding bugs before they are found by
users.

> From my side, it is only recently I got some clear answers to my questions
> about how it worked. I think it is very important that major features have
> extensive README type documentation with them so the underlying principles used
> in the development are clear. I would define the measure of a good feature as
> whether another committer can read the code comments and get a good feel. A bad
> feature is one where committers walk away from it, saying I don't really get it
> and I can't read an explanation of why it does that. Tom's most significant
> contribution is his long descriptive comments on what the problem is that need
> to be solved, the options and the method chosen. Clarity of thought is what
> solves bugs.

Yes, I think we should have done that early-on for multi-xact, and I am
hopeful we will learn to do that more often when complex features are
implemented, or when we identify areas that are more complex than we
thought.

> Overall, I don't see the need to stop the normal release process and do a
> holistic review. But I do think we should check each feature to see whether it
> is fully documented or whether we are simply trusting one of us to be around to
> fix it.

Agreed. We just need to be honest that we are doing what we need for
reliability and not allow schedule and feature pressure to cause us to
skimp in this area.

> I am just saying we need to ask the
> reliability question _first_.
>
>
> Agreed
>  
>
> Let me restate something that has appeared in many replies to my ideas
> --- I am not asking for infinite or unbounded review, but I am asking
> that we make sure reliability gets the proper focus in relation to our
> time pressures.  Our balance was so off a month ago that I feel only a
> full stop on time pressure would allow us to refocus because people are
> not good at focusing on multiple things. It is sometimes necessary to
> stop everything to get people's attention, and to help them remember
> that without reliability, a database is useless.
>
>
> Here, I think we are talking about different types of reliability. PostgreSQL
> software is well ahead of most industry measures of quality; these recent bugs
> have done nothing to damage that, other than a few people woke up and said
> "Wow! Postgres had a bug??!?!?". The presence of bugs is common and if we have
> grown unused to them, we should be wary of that, though not tolerant.

In going over the 9.5 commits, I was struck by a high volume of cleanups
and fixes, which is good.

> PostgreSQL is now reliable in the sense that we have many features that ensure
> availability even in the face of software problems and bug induced corruption.
> Those have helped us get out of the current situations, giving users a
> workaround while bugs are fixed. So the impact of database software bugs is not
> what it once was.

Uh, yes, we have avoided the worst of the impact from these bugs. In my
understanding, each bug has X% chance of being serious, and you might go
for a long time before a serious bug is created, but the more bugs we
have, the more likely that one will serious. The _volume_ of multi-xact
bugs should have triggered a review much sooner.

People think I want to stop feature development to review. What I am
saying is that we need to stop development so we can be honest about
whether we need review, and where. It is hard to be honest when time
and feature pressure are on you. It shouldn't take long to make that
decision as a group.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2015-06-06 03:08:53 Re: Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
Previous Message Amit Kapila 2015-06-06 03:05:57 Re: Warn about using single user + standby_mode