From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, rootcause000(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows |
Date: | 2024-12-11 13:32:41 |
Message-ID: | CA+hUKGJSOEZWx9zRFFfOfDj9zbVmMbWKRq+ZzMCXftHcO1UYQQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
(Replies to separate messages from Michael and Robert below.)
On Tue, Oct 29, 2024 at 8:49 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> I was looking at what you have here, and the split with
> smgrtruncatefrom() to do the allocations in _mdfd_openseg() for
> _mdfd_segpath() and _fdvec_resize() before entering in the critical
> section for the physical truncation is elegant.
Thanks for the sanity check. It took a few goes around to land there.
> Perhaps this should have an assert based on CritSectionCount==0 to
> force the rule.
Added.
> Don't you think that we'd better have a regression test on HEAD at
> least? It should not be complicated. I can create one if you want,
> perhaps for later if we want to catch the next minor release train.
I'm probably lacking imagination here; I can see that it's useful to
use the injection tooling for repros without the fix, but the fix is
mostly a critical section. We know what critical sections do, they
hold off interrupts and explode quite reliably on error, so there is
not much left to test once it's fixed, no?
> Making folks aware of the problem sounds kind of sensible seen from
> here. In short, changing the signature of smgr_truncate() in a minor
> release to fix what's a severe data corruption issue takes priority
> IMO.
Yeah. I'd avoid it if we had another way, but in the absence of epiphanies...
On Fri, Nov 15, 2024 at 4:51 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I took a look at v4-0001 today and I think it looks fine. While I'm
> not opposed to adding a test case, I think it's more important to fix
> the bug at this point than to wait longer for a test case to show up.
Thanks for taking a look! I started out to re-test and hopefully
commit this today (having already committed 75818b3a from upthread
last week), but while reviewing the reviews, this turned out to be a
thing:
> I do wonder whether the new smgrtruncatefrom() should be used
> everywhere instead of just from one of the call sites, but even if
> that's desirable long-term, doing this much is still a lot better than
> doing nothing. This data corrupting bug has been known and understood
> for more than 4 years at this point.
There are only three callers of smgrtruncate() in our tree:
1. This one.
2. Its redo-time counterpart.
3. pg_truncate_visibility_map(), in contrib/pg_visibility.
The recovery environment promotes errors of both types to FATAL (the
"we have no handler" case in errstart()), and the postmaster handles
startup failure a bit like a PANIC. Or do you think PANIC and more
similar code would be better?
As for pg_truncate_visibility_map(), it is a "a cut-down version of
RelationTruncate". It logs in a different order (which probably
doesn't really matter as it doesn't flush anyway) and it didn't get
the memo about checkpoints. Maybe it's not such a big deal to be
sloppy about whether or not the _vm file is truncated or not after
crash recovery, given the semantics, I'm not sure. It surely has a
version of the zombie tuples bug, though: If a _vm bit recently
changed from "visible" to "not visible" and hasn't been written to
disk yet, and it is booted out of the buffer pool by a call to
pg_truncate_visibility_map() that is then cancelled in a call to
WaitIO() for some other buffer so that ftruncate() is not reached,
then an index-only scan might read in the old disk copy and show you a
bunch of index tuples that are supposed to be invisible.
I am curious about the non-flushing though. RelationTruncate() also
doesn't flush if there is no fsm or vm. Perhaps it always flushes in
practice since VACUUM itself creates those, I'm not entirely sure yet.
If the log-first rule is abandoned and the flush is allowed to happen
after the critical section, isn't there an opportunity for a primary
to insert into but not flush the WAL, truncate the file, power-cycle,
run recovery, and then carry on without ever telling a standby to
truncate? Maybe that's OK for _vm (maybe the standby's untruncated
copy of _vm would still be correct, and be maintained correctly by heap
operations and everything would be fine, I'm not sure), by why would we
want to allow this desynchronisation/confusion, just to make an
infrequently used debugging tool go a bit faster? As for
RelationTruncate(), if the non-flushing case is reachable, it would
seem more obviously worse: it could leave the standby with a bunch of
invisible tuples that the primary doesn't have, and eventually they'd
either reference truncated CLOG or become visible again via wraparound
rebirth, no? Am I missing something? Why wouldn't we just make both
places flush unconditionally?
PFA the first sketch of a patch to make pg_truncate_visibility_map()
into... a cut-down version of RelationTruncate(). Still looking into
the flushing, without which the comments in the patch are entirely
bogus.
Attachment | Content-Type | Size |
---|---|---|
v5-0001-Fix-corruption-on-failed-interrupted-truncation.patch | application/x-patch | 12.4 KB |
v5-0002-Fix-pg_truncate_visibility_map-protocol.patch | application/x-patch | 3.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Марат Гасанян | 2024-12-11 14:47:23 | Problem with constraint unique. |
Previous Message | Artur Zakirov | 2024-12-11 11:38:59 | Re: pg_dump crash on identity sequence with not loaded attributes |