From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: a very minor bug and a couple of comment changes for basebackup.c |
Date: | 2023-03-06 16:07:07 |
Message-ID: | CA+TgmoZ8NbvCaeTO43akNJw-9OWCV94yE7o+4+KbE2kRd2tO=w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thanks for the review. I have committed the patches.
On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> Seems right, I think that you should backpatch that as
> VERIFY_CHECKSUMS is the default.
Done.
> There is more to it: the page LSN is checked before its checksum.
> Hence, if the page's LSN is corrupted in a way that it is higher than
> sink->bbs_state->startptr, the checksum verification is just skipped
> while the page is broken but not reported as such. (Spoiler: this has
> been mentioned in the past, and maybe we'd better remove this stuff in
> its entirety.)
Yep. It's pretty embarrassing that we have a checksum verification
feature that has both known ways of producing false positives and
known ways of producing false negatives and we have no plan to ever
fix that, we're just going to keep shipping what we've got. I think
it's pretty clear that the feature shouldn't have been committed like
this; valid criticisms of the design were offered and simply not
addressed, not even by updating the comments or documentation with
disclaimers. I find the code in sendFile() pretty ugly, too. For all
of that, I'm a bit uncertain whether ripping it out is the right thing
to do. It might be (I'm not sure) that it tends to work decently well
in practice. Like, yes, it could produce false checksum warnings, but
does that actually happen to people? It's probably not too likely that
a read() or write() of 8kB gets updated after doing only part of the
I/O, so the concurrent read or write is fairly likely to be on-CPU, I
would guess, and therefore this algorithm might kind of work OK in
practice despite begin garbage on a theoretical level. Similarly, the
problems with how the LSN is vetted make it likely that a page
replaced with random garbage will go undetected, but a page where a
few bytes get flipped in a random position within the page is likely
to get caught, and maybe the latter is actually a bigger risk than the
former. I don't really know. I'd be interested to hear from anyone
with a lot of practical experience using the feature. A few anecdotes
of the form "this routine fails to tell us about problems" or "this
often complains about problems that are not real" or "this has really
helped us out on a number of occasions and we have had no problems
with it" would be really helpful.
On the other hand, we could just say that the code is nonsense and
therefore, regardless of practical experience, it ought to be removed.
I'm somewhat open to that idea, too.
--
Robert Haas
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Önder Kalacı | 2023-03-06 16:27:59 | Re: Time delayed LR (WAS Re: logical replication restrictions) |
Previous Message | Daniel Gustafsson | 2023-03-06 16:06:22 | Re: Allow tests to pass in OpenSSL FIPS mode |