Re: Offline enabling/disabling of data checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Magnus Hagander <magnus(at)hagander(dot)net>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Sergei Kornilov <sk(at)zsrv(dot)org>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2019-03-19 16:13:17
Message-ID: 20190319161317.utyo6pnkxznxu4bu@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-03-19 17:08:17 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund:
> > On 2019-03-19 16:55:12 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > > > > +/*
> > > > > + * Locations of persistent and temporary control files. The control
> > > > > + * file gets renamed into a temporary location when enabling checksums
> > > > > + * to prevent a parallel startup of Postgres.
> > > > > + */
> > > > > +#define CONTROL_FILE_PATH "global/pg_control"
> > > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH ".pg_checksums_in_progress"
> > > >
> > > > I think this should be outright rejected. Again, you're making the
> > > > control file into something it isn't. And there's no buyin for this as
> > > > far as I can tell outside of Fabien and you. For crying out loud, if the
> > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER.
> > >
> > > The cluster is supposed to be offline during this. This is just an
> > > additional precaution so that nobody starts it during the operation -
> > > similar to how pg_upgrade disables the old data directory.
> >
> > I don't see how that matters. Afterwards the cluster needs low level
> > surgery to be recovered. That's a) undocumented b) likely to be done
> > wrongly. This is completely unacceptable *AND UNNECESSARY*.
>
> Can you explain why low level surgery is needed and how that would look
> like?
>
> If pg_checksums successfully enables checksums, it will move back the
> control file and update the checksum version - the cluster is ready to
> be started again unless I am missing something?
>
> If pg_checksums is interrupted by the admin, it will move back the
> control file and the cluster is ready to be started again as well.
>
> If pg_checksums aborts with a failure, the admin will have to move back
> the control file before starting up the instance again, but I don't
> think that counts?

That absolutely counts. Even a short period would imo be unacceptable,
but this will take *hours* in many clusters. It's completely possible
that the machine crashes while the enabling is in progress.

And after restarting postgres or even pg_checksums, you'll just get a
message that there's no control file. How on earth is a normal user
supposed to recover from that? Now, you could have a check for the
control file under the temporary name, and emit a hint about renaming,
but that has its own angers (like people renaming it just to start
postgres).

And you're basically adding it because Fabien doesn't like
postmaster.pid and wants to invent another lockout mechanism in this
thread.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2019-03-19 16:30:16 Re: Offline enabling/disabling of data checksums
Previous Message Michael Banck 2019-03-19 16:08:17 Re: Offline enabling/disabling of data checksums