Re: basebackup checksum verification

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, David Steele <david(at)pgmasters(dot)net>
Subject: Re: basebackup checksum verification
Date: 2019-03-27 01:01:27
Message-ID: 20190327010126.GG6197@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Andres Freund (andres(at)anarazel(dot)de) wrote:
> On 2019-03-26 20:18:31 -0400, Stephen Frost wrote:
> > > >>I thought Robert's response was generally good, pointing out that
> > > >>we're talking about this being an issue if the corruption happens in a
> > > >>certain set of bytes. That said, I'm happy to see improvements in
> > > >>this area but I'm flat out upset about the notion that we must be
> > > >>perfect here- our checksums themselves aren't perfect for catching
> > > >>corruption either.
> > > >
> > > >The point is that we're not detecting errors that we can detect when
> > > >read outside of basebackup. I really entirely completely fail how that
> > > >can be defended.
> > > >
> > > >I think we're making promises with this the basebackup feature we're not
> > > >even remotely keeping. I don't understand how you can defend that, given
> > > >the current state, you can have a basebackup that you took with
> > > >checksums enabled, and then when actually use that basebackup you get
> > > >checksum failures. Like it's one thing not to detect all storage
> > > >issues, but if we do detect them after using the basebackup, that's
> > > >really not ok.
> > >
> > > Yeah, if basebackup completes without reporting any invalid checksums, but
> > > running pg_verify_checksums on the same backups detects those, that probably
> > > should raise some eyebrows.
> >
> > That isn't actually what would happen at this point, just so we're
> > clear. What Andres is talking about is a solution which would only
> > actually work for pg_basebackup, and not for pg_verify_checksums
> > (without some serious changes which make it connect to the running
> > server and run various functions to perform the locking that he's
> > proposing pg_basebackup do...).
>
> Well, I still think it's just plain wrong to do online checksum
> verification outside of the server, and we should just reject adding
> that as a feature.

I get that, and I disagree with it.

> Besides the fact that I think having at precisely equal or more error
> detection capabilities than the backend, I think all the LSN based
> approaches also have the issue that they'll prevent us from using them
> on non WAL logged data. There's ongoing work to move SLRUs into the
> backend allowing them to be checksummed (Shawn Debnath is IIRC planning
> to propose a patch for v13), and we also really should offer to also
> checksum unlogged tables (and temp tables?) - just because they'd be
> gone after a crash, imo doesn't make it OK to not detect corrupted on
> disk data outside of a crash. For those things we won't necessarily
> have LSNs that we can conveniently can associate with those buffers -
> making LSN based logic harder.

I'm kinda guessing that the SLRUs are still going to be WAL'd. If
that's changing, I'd be very curious to hear the details.

As for unlogged table and temp tables, it's an interesting idea to
checksum them but far less valuable by the very nature of what those are
used for. Futher, it's utterly useless to checksum as part of backup,
like what pg_basebackup is actually doing, and is actually valuable to
skip over them rather than back them up, since otherwise we'd back them
up, and then restore them ... and then remove them immediately during
recovery from the backup state. Having a way to verify checksum on
unlogged tables or temp tables using some *other* tool or background
process could be valuable, provided it doesn't cause any issues for
ongoing operations.

> > I outlined a couple of other approaches to improving that situation,
> > which would be able to be used with pg_verify_checksums without having
> > to connect to the backend, but I'll note that those were completely
> > ignored, leading me to believe that there's really not much more to
> > discuss here since other ideas are just not open to being considered.
>
> Well, given that we can do an accurate determination without too much
> code in the basebackup case, I don't see what your proposals gain over
> that? That's why I didn't comment on them. I'm focusing on the
> basebackup case, over the online checksum case, because it's released
> code.

I'm fine with improving the basebackup case, but I don't agree at all
with the idea that all checksum validation must be exclusively done in a
backend process.

I'm also not convinced that these changes to pg_basebackup will be free
of issues that may impact users in a negative way, making me concerned
that we're going to end up doing more harm than good with such a change
being back-patched. Simply comparing the skipped LSNs to the
end-of-backup LSN seems much less invasive when it comes to this core
code, and certainly increases the chances quite a bit that we'll detect
an issue with corruption in the LSN.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-03-27 01:04:58 Re: basebackup checksum verification
Previous Message Inoue, Hiroshi 2019-03-27 01:01:08 Re: setLastTid() and currtid()