From: | Stephen Frost <sfrost(at)snowman(dot)net> |
---|---|
To: | David Steele <david(at)pgmasters(dot)net> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Online verification of checksums |
Date: | 2020-11-20 16:08:27 |
Message-ID: | 20201120160827.GY16415@tamriel.snowman.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Greetings,
* David Steele (david(at)pgmasters(dot)net) wrote:
> On 11/20/20 2:28 AM, Michael Paquier wrote:
> >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> >>I was referring to the latest patch on the thread. But as I said, I have
> >>not read up on all the different issues raised in the thread, so take it
> >>with a big grain os salt.
> >>
> >>And I would also echo the previous comment that this code was adapted from
> >>what the pgbackrest folks do. As such, it would be good to get a comment
> >>from for example David on that -- I don't see any of them having commented
> >>after that was mentioned?
> >
> >Agreed. I am adding Stephen as well in CC. From the code of
> >backrest, the same logic happens in src/command/backup/pageChecksum.c
> >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> >happen before verifying the checksum. So, if the page header finishes
> >with random junk because of some kind of corruption, even corrupted
> >pages would be incorrectly considered as correct if the random data
> >passes the pd_upper and pg_lsn checks :/
>
> Indeed, this is not good, as Andres pointed out some time ago. My apologies
> for not getting to this sooner.
Yeah, it's been on our backlog to improve this.
> Our current plan for pgBackRest:
>
> 1) Remove the LSN check as you have done in your patch and when rechecking
> see if the page has become valid *or* the LSN is ascending.
> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> is valid.
Yup, that's my recollection also as to our plans for how to improve
things here.
> These do completely rule out any type of corruption, but they certainly
> narrows the possibility by a lot.
*don't :)
> In the future we would also like to scan the WAL to verify that the page is
> definitely being written to.
Yeah, that'd certainly be nice to do too.
> As for your patch, it mostly looks good but my objection is that a page may
> be reported as invalid after 5 retries when in fact it may just be very hot.
Yeah.. while unlikely that it'd actually get written out that much, it
does seem at least possible.
> Maybe checking for an ascending LSN is a good idea there as well? At least
> in that case we could issue a different warning, instead of "checksum
> verification failed" perhaps "checksum verification skipped due to
> concurrent modifications".
+1.
Thanks,
Stephen
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2020-11-20 16:09:35 | Re: VACUUM (DISABLE_PAGE_SKIPPING on) |
Previous Message | Stephen Frost | 2020-11-20 16:03:49 | Re: Disable WAL logging to speed up data loading |