From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Changeset Extraction v7.0 (was logical changeset generation) |
Date: | 2014-01-23 16:50:57 |
Message-ID: | CA+TgmoZ1DTGKJ6FthQ7vSAiniih2LZ_aL0FM8kCzQNc8d2Gfmg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I don't think shared buffers fsyncs are the apt comparison. It's more
> something like UpdateControlFile(). Which PANICs.
>
> I really don't get why you fight PANICs in general that much. There are
> some nasty PANICs in postgres which can happen in legitimate situations,
> which should be made to fail more gracefully, but this surely isn't one
> of them. We're doing rename(), unlink() and rmdir(). That's it.
> We should concentrate on the ones that legitimately can happen, not the
> ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or
> mount -o remount,ro /. We don't increase reliability by a bit adding
> codepaths that will never get tested.
Sorry, I don't buy it. Lots of people I know have stories that go
like this "$HORRIBLE happened, and PostgreSQL kept on running, and it
didn't even lose my data!", where $HORRIBLE may be variously that the
disk filled up, that disk writes started failing with I/O errors, that
somebody changed the permissions on the data directory inadvertently,
that the entire data directory got removed, and so on. I've been
through some of those scenarios myself, and the care and effort that's
been put into failure modes has saved my bacon more than a few times,
too. We *do* increase reliability by worrying about what will happen
even in code paths that very rarely get exercised. It's certainly
true that our bug count there is higher there than for the parts of
our code that get exercised more regularly, but it's also lower than
it would be if we didn't make the effort, and the dividend that we get
from that effort is that we have a well-deserved reputation for
reliability.
I think it's completely unacceptable for the failure of routine
filesystem operations to result in a PANIC. I grant you that we have
some existing cases where that can happen (like UpdateControlFile),
but that doesn't mean we should add more. Right this very minute
there is massive bellyaching on a nearby thread caused by the fact
that a full disk condition while writing WAL can PANIC the server,
while on this thread at the very same time you're arguing that adding
more ways for a full disk to cause PANICs won't inconvenience anyone.
The other thread is right, and your argument here is wrong. We have
been able to - and have taken the time to - fix comparable problems in
other cases, and we should do the same thing here.
As for why I fight PANICs so much in general, there are two reasons.
First, I believe that to be project policy. I welcome correction if I
have misinterpreted our stance in that area. Second, I have
encountered a few situations where customers had production servers
that repeatedly PANICked due to some bug or other. If I've ever
encountered angrier customers, I can't remember when. A PANIC is no
big deal when it happens on your development box, but when it happens
on a machine with 100 users connected to it, it's a big deal,
especially if a single underlying cause makes it happen over and over
again.
I think we should be devoting our time to figuring how to improve
this, not whether to improve it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2014-01-23 17:21:40 | Re: Changeset Extraction v7.0 (was logical changeset generation) |
Previous Message | Andres Freund | 2014-01-23 16:48:39 | Re: Add %z support to elog/ereport? |