From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Daniel Wood <hexexpert(at)comcast(dot)net> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: BTP_DELETED leaf still in tree |
Date: | 2019-10-12 05:22:34 |
Message-ID: | CAH2-Wz=Regj-bgCrKQqnSekQoVOh3Bw2MLA-_SzDgBVx7j5bww@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Oct 11, 2019 at 12:44 AM Daniel Wood <hexexpert(at)comcast(dot)net> wrote:
> > Actually, I take it back -- the looping part is not normal. The
> > btpo_next->btpo_next page has no business linking back to the
> > original/first deleted page you mentioned. That's just odd.
>
> btpo_next->btpo_next does NOT link directly back to the 1st deleted page. It simply links to some in-use page which is 50 or so leaf pages back in the tree.
That sounds more normal.
> > Can you provide me with a dump of the page images? The easiest way of
> > getting a page dump is described here:
>
> Customer data. Looks like meaningless customer data (5 digit key values). But too much paperwork. :-)
Well, it was worth a try. ;-)
> The hard part for me to understand isn't just why the DELETED leaf node is still referenced in the mid tree node.
> It is that the step which sets BTP_DELETED should have also linked its leaf and right siblings together. But this hasn't been done.
Before the page becomes BTP_DELETED, it must first be BTP_HALF_DEAD.
And that is also the point where it should be impossible for scans to
reach the page, more or less (there is still that narrow window where
the downlink can followed just before its deleted, making the scan
land on the BTP_HALF_DEAD page -- I mentioned this in my first mail).
> Could the page have already have been dirty, but because of "target != leafblkno", we didn't stamp a new LSN on it. Could this allow us to write the DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and XLOG_BTREE_UNLINK_PAGE being flushed? Of course, I don't understand the "target != leafblkno".
The "target != leafblkno" thing concerns whether or not this is a
multi-level deletion (actually, that's not quite right, since even a
multi-level deletion has "target == leafblkno" at the point where it
finally gets to mark a half dead page fully deleted).
Yes, it's odd that this deleted page exists, even though its siblings
still link to it -- the distinction between a fully deleted page and a
half dead page is really just the fact that a fully deleted page is
supposed to not be linked to from anywhere, including still-live
siblings. But you don't have to get that far to see evidence of
corruption -- having a downlink pointing to a half-dead page is
evidence enough of corruption.
(Actually, it's more complicated than that -- see the comments in
amcheck's bt_downlink_check() function from Postgres 11 or 12.
Multi-level deletion is a case where a half-dead page has a downlink,
but the subtree undergoing deletion is still isolated in about the
same way as it is in the simple single level case, since the
"topparent" downlink is zapped at the same point that the leaf page is
marked half-dead. The important thing is that even half-dead pages are
not reachable by descending the tree, except for the tiny window where
the topparent downlink is observed the instant before it is zapped.)
If page deletion didn't exist, it would be so much easier to
understand the B-Tree code.
My guess is that there wasn't sufficient WAL to replay the page
deletion, but some of the buffers were written out. You might have
"gotten away with it" if the internal page also happened to be written
out along with everything else, but it just didn't work out that way.
Remember, there are two weird things about this, that overlap with two
distinct types of atomic operations: the fact that the downlink still
exists at all, and the fact that the sidelinks still exist at all.
This smells like a problem with slightly inconsistent page images, as
opposed to a problem with how one particular atomic operation did
something. It's not actually surprising that this would be the first
place that you'd notice a generic issue, since many other things are
"more forgiving" in various ways.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2019-10-12 05:59:16 | Re: [HACKERS] Block level parallel vacuum |
Previous Message | Masahiko Sawada | 2019-10-12 05:18:43 | Re: maintenance_work_mem used by Vacuum |