Quick Links

Re: Online verification of checksums

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 22:01:33
Message-ID:	20180917220133.GC4184@tamriel.snowman.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> > On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> > <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
> > Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> >
> > How? Either it’s possible for the latter half of a page to be updated
> > before the first half (where the LSN lives), or it isn’t. If it’s
> > possible then that LSN could be ancient and it wouldn’t matter.
>
> I'm not sure I understand what you're saying here.
>
> It is not about the latter page to be updated before the first half. I
> don't think that's quite possible, because write() into page cache does
> in fact write the data sequentially.

Well, maybe 'updated before' wasn't quite the right way to talk about
it, but consider if a read(8K) gets only half-way through the copy
before having to go do something else and by the time it gets back, a
write has come in and rewritten the page, such that the read(8K)
returns half-old and half-new data.

> The problem is that the write is not atomic, and AFAIK it happens in
> sectors (which are either 512B or 4K these days). And it may arbitrarily
> interleave with reads.

Yes, of course the write isn't atomic, that's clear.

> So you may do write(8k), but it actually happens in 512B chunks and a
> concurrent read may observe some mix of those.

Right, I'm not sure that we really need to worry about sub-4K writes
though I suppose they're technically possible, but it doesn't much
matter in this case since the LSN is early on in the page, of course.

> But the trick is that if the read sees the effect of the write somewhere
> in the middle of the page, the next read is guaranteed to see all the
> preceding new data.

If that's guaranteed then we can just check the LSN and be done.

> Without the checkpoint we risk seeing the same write() both in read and
> re-read, just in a different stage - so the LSN would not change, making
> the check futile.

This is the part that isn't making much sense to me. If we are
guaranteed that writes into the kernel cache are always in order and
always at least 512B in size, then if we check the LSN first and
discover it's "old", and then read the rest of the page and calculate
the checksum, discover it's a bad checksum, and then go back and re-read
the page then we *must* see that the LSN has changed OR conclude that
the checksum is invalidated.

The reason this can happen in the first place is that our 8K read might
only get half-way done before getting scheduled off and a 8K write
happened on the page before our read(8K) gets back to finishing the
read, but if what you're saying is true, then we can't ever have a case
where such a thing would happen and a re-read would still see the "old"
LSN.

If we check the LSN first and discover it's "new" (as in, more recent
than our last checkpoint, or the checkpoint where the backup started)
then, sure, there's going to be a risk that the page is currently being
written right that moment and isn't yet completely valid.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half. That's a problem that I haven't got a solution to today. I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

> But by waiting for the checkpoint we know that the original write is no
> longer in progress, so if we saw a partial write we're guaranteed to see
> a new LSN on re-read.
>
> This is what I mean by the checkpoint / fsync guarantee.

I don't think any of this really has anythign to do with either fsync
being called or with the actual checkpointing process (except to the
extent that the checkpointer is the thing doing the writing, and that we
should be checking the LSN against the LSN of the last checkpoint when
we started, or against the start of the backup LSN if we're talking
about doing a backup).

> > The question is if it’s possible to catch a torn page where the second
> > half is updated *before* the first half of the page in a read (and then
> > in subsequent reads having that state be maintained). I have some
> > skepticism that it’s really possible to happen in the first place but
> > having an interrupted system call be stalled across two more system
> > calls just seems terribly unlikely, and this is all based on the
> > assumption that the kernel might write the second half of a write before
> > the first to the kernel cache in the first place.
>
> Yes, if that was possible, the explanation about the checkpoint fsync
> guarantee would be bogus, obviously.
>
> I've spent quite a bit of time looking into how write() is handled, and
> I believe seeing only the second half is not possible. You may observe a
> page torn in various ways (not necessarily in half), e.g.
>
> [old,new,old]
>
> but then the re-read you should be guaranteed to see new data up until
> the last "new" chunk:
>
> [new,new,old]
>
> At least that's my understanding. I failed to deduce what POSIX says
> about this, or how it behaves on various OS/filesystems.
>
> The one thing I've done was writing a simple stress test that writes a
> single 8kB page in a loop, reads it concurrently and checks the
> behavior. And it seems consistent with my understanding.

Good.

> > Use that to compare to what? The LSN in the first half of the page
> > could be from well before the checkpoint or even the backup started.
>
> Not sure I follow. If the LSN in the page header is old, and the
> checksum check failed, then on re-read we either find a new LSN (in
> which case we skip the page) or consider this to be a checksum failure.

Right, I'm in agreement with doing that and it's what is done in
pgbasebackup and pgBackRest.

Thanks!

Stephen

In response to

Re: Online verification of checksums at 2018-09-17 21:33:52 from Tomas Vondra

Responses

Re: Online verification of checksums at 2018-09-18 00:34:35 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2018-09-17 23:35:25	Re: infinite loop in parallel hash joins / DSA / get_best_segment
Previous Message	Tom Lane	2018-09-17 22:00:58	Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi