From: | Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |
---|---|
To: | Peter Geoghegan <pg(at)bowt(dot)ie> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improve search for missing parent downlinks in amcheck |
Date: | 2020-02-24 22:54:10 |
Message-ID: | CAPpHfdsUvbFiXjsLpQL1uW7DO+QfzGOWDqNaCEdbAYPAoGgsAQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi!
Thank you for your review. The revised version is attached.
On Wed, Feb 19, 2020 at 1:16 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> > > Don't need the "!P_ISLEAF()" here.
> >
> > Why don't I need. bt_downlink_connectivity_check() checks one level
> > down to the target level. But there is no one level down to leaf...
>
> Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe
> it's better your way, though -- apparently it's clearer.
Oh, I see. I prefer to leave it my way. Explicit check is nearly
free and makes it clearer for me.
> > > > Alternatively
> > > > + * we might find child with high key while traversing from
> > > > + * previous downlink to current one. Then matching key resides
> > > > + * the same offset number as current downlink.
> > > > + */
> > >
> > > Not sure what "traversing from previous downlink to current one" means at all.
> >
> > I've rephrased this comment, please check.
>
> > Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It
> > becomes even simpler now, thanks!
>
> There was actually an even better reason to invent
> bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs
> to do something like the extra steps that you see in routines like
> invariant_l_offset(). _bt_compare() will return 0 when the insertion
> scankey has a prefix of scankey/column values that are equal, even
> though there may be additional columns in the index tuple that are not
> compared. So, you could have a truncated multi-column high key that is
> "equal" to pivot tuple in parent that is actually to the right in the
> key space. This blind spot would often occur with low cardinality
> indexes, where we often have something like this in pivot tuples on
> internal pages:
>
> 'foo, -inf'
> 'foo, (1,24)'
> 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
> page that's filled with duplicates of the value 'foo'
> 'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
> leaf page that's filled with duplicates of the value 'food'
> ...
>
> The situation is really common in low cardinality indexes because
> nbtsplitloc.c hates splitting a leaf page between two duplicates -- it
> is avoided whenever possible. You reliably get a '-inf' value for the
> TID in the first pivot tuple for the duplicate, followed by a real
> heap TID for later pivot tuples for pages with the same duplicate
> value.
>
Thank you for the explanation!
> > > I think bt_downlink_check() and bt_downlink_connectivity_check()
> > > should be renamed to something broader. In my mind, downlink is
> > > basically a block number. We have been sloppy about using the term
> > > downlink when we really mean "pivot tuple with a downlink" -- I am
> > > guilty of this myself. But it seems more important, now that you have
> > > the new high key check.
> >
> > Hmm... Names are hard for me. I didn't do any renaming for now. What
> > about this?
> > bt_downlink_check() => bt_child_check()
> > bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()
>
> I suggest:
>
> bt_downlink_check() => bt_child_check()
> bt_downlink_connectivity_check() => bt_child_highkey_check()
>
> While bt_downlink_connectivity_check() moves right on the target's
> child level, this isn't the common case. Moving right like that
> doesn't need to be suggested by the name of the function.
>
> Most of the time, we just check the high key -- right?
Good. Renamed as you proposed.
> * You should say "target page lsn" here instead:
>
> pg(at)tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true);
> ERROR: mismatch between parent key and child high key in index "i_holding2"
> DETAIL: Target block=1570 child block=1690 parent page lsn=0/0.
> Time: 12.509 ms
Yep, fixed.
> * Maybe say "Move to the right on the child level" in a comment above
> the bt_downlink_connectivity_check() "while (true)" loop here:
Comment is added.
> > +
> > + while (true)
> > + {
> > + /*
> > + * Did we traverse the whole tree level and this is check for pages to
> > + * the right of rightmost downlink?
> > + */
>
> * If you are going to save a low key for the target page in memory,
> then you only need to do so for "state->readonly"/parent verification.
Sure, now it's just for state->readonly.
> * You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low
> key". This is a term that nbtsort.c now uses, in case you didn't know.
Changed, thank you for pointing.
> * The reason for saving a low key for each target page is very
> unclear. Can we fix this?
Actually, lowkey is used for removing "cousin page verification blind
spot" when there are incomplete splits. It might happen that we read
child with hikey matching its parent high key only when
bt_child_highkey_check() is called for "minus infinity" key of parent
right sibling. Saving low key helps in this case.
> The low key hardly ever gets used -- I can comment out the code that
> uses a low key within bt_downlink_connectivity_check(), and indexes
> that I tested appear fine. So I imagine this is for incomplete split
> cases -- right?
>
> I tested incomplete split cases using this simple hack:
>
> diff --git a/src/backend/access/nbtree/nbtinsert.c
> b/src/backend/access/nbtree/nbtinsert.c
> index 4e5849ab8e..5811338584 100644
> --- a/src/backend/access/nbtree/nbtinsert.c
> +++ b/src/backend/access/nbtree/nbtinsert.c
> @@ -1821,7 +1821,7 @@ _bt_insert_parent(Relation rel,
> */
> _bt_relbuf(rel, rbuf);
>
> - if (pbuf == InvalidBuffer)
> + if (random() <= (MAX_RANDOM_VALUE / 100))
> ereport(ERROR,
> (errcode(ERRCODE_INDEX_CORRUPTED),
> errmsg_internal("failed to re-find parent key in
> index \"%s\" for split pages %u/%u",
>
> Now, 1% of all page splits will fail after the first phase finishes,
> but before the second phase begins/finishes. The incomplete split flag
> will be found set by other, future inserts, though, at which point the
> split will be finished (unless we're unlucky again).
>
> An INSERT that inserts many rows to bound to fail with this hack
> applied, but the INSERT shouldn't leave the index in a state that
> looks like corruption to amcheck. I can see false positive reports of
> corruption like this, though.
>
> Consider the following sample session. The session inserts data from
> one table into another table with an equivalent schema. As you can
> see, I have to get my INSERT to fail three times before I see the
> problem:
>
> pg(at)tpce:5432 [13026]=# truncate holding2;
> TRUNCATE TABLE
> pg(at)tpce:5432 [13026]=# insert into holding2 select * from holding;
> ERROR: failed to re-find parent key in index "pk_holding2" for split
> pages 83/158
> pg(at)tpce:5432 [13026]=# select bt_index_parent_check('i_holding2',true, true);
> DEBUG: verifying level 1 (true root level)
> DEBUG: verifying 200 items on internal block 3
> DEBUG: verifying level 0 (leaf level)
> DEBUG: verifying 262 items on leaf block 1
> DEBUG: verifying 258 items on leaf block 2
> *** SNIP ***
> DEBUG: verifying 123 items on leaf block 201
> DEBUG: verifying that tuples from index "i_holding2" are present in "holding2"
> DEBUG: finished verifying presence of 0 tuples from table "holding2"
> with bitset 4.83% set
> ┌───────────────────────┐
> │ bt_index_parent_check │ <-- No false positive yet
> ├───────────────────────┤
> │ │
> └───────────────────────┘
> (1 row)
>
> pg(at)tpce:5432 [13026]=# insert into holding2 select * from holding;
> DEBUG: finishing incomplete split of 83/158
> ERROR: failed to re-find parent key in index "i_holding2" for split
> pages 435/436
> pg(at)tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
> DEBUG: verifying level 2 (true root level)
> DEBUG: verifying 2 items on internal block 303
> *** SNIP ***
> DEBUG: verifying 74 items on leaf block 436
> DEBUG: verifying that tuples from index "i_holding2" are present in "holding2"
> DEBUG: finished verifying presence of 0 tuples from table "holding2"
> with bitset 9.97% set
> ┌───────────────────────┐
> │ bt_index_parent_check │ <-- No false positive yet
> ├───────────────────────┤
> │ │
> └───────────────────────┘
> (1 row)
>
> pg(at)tpce:5432 [13026]=# insert into holding2 select * from holding;
> ERROR: failed to re-find parent key in index "i_holding2" for split
> pages 331/577
> pg(at)tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
> DEBUG: verifying level 2 (true root level)
> DEBUG: verifying 3 items on internal block 303
> DEBUG: verifying level 1
> DEBUG: verifying 151 items on internal block 3
> DEBUG: verifying 189 items on internal block 521
> DEBUG: verifying 233 items on internal block 302
> WARNING: 577 rightsplit
> ERROR: leaf index block lacks downlink in index "i_holding2". <--
> False positive "corruption"
> DETAIL: Block=127 page lsn=0/0.
>
> Notice I added a custom WARNING here, and we get the false positive
> error just after the point that the WARNING is seen. This WARNING is
> triggered by the temporary instrumentation change that I added to
> amcheck, and can be seen in some cases that don't have an accompanying
> false positive report of corruption:
>
> diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
> index d6cb10e263..dcfee9e4fa 100644
> --- a/contrib/amcheck/verify_nbtree.c
> +++ b/contrib/amcheck/verify_nbtree.c
> @@ -1654,6 +1654,8 @@ bt_downlink_connectivity_check(BtreeCheckState *state,
> errmsg("circular link chain found in block %u of
> index \"%s\"",
> blkno, RelationGetRelationName(state->rel))));
>
> + if (rightsplit)
> + elog(WARNING, "%u rightsplit", blkno);
> if (!first && !P_IGNORE(opaque))
> {
> /* blkno probably has missing parent downlink */
> *** END DIFF ***
>
> Anyway, I didn't take the time to debug this today, but I suspect that
> the problem is that the lowkey thing is kind of brittle. If you can't
> figure it out, let me know and I'll help with it.
It appears that these false positives were cause by very basic error made here:
if (!first && !P_IGNORE(opaque))
{
/* blkno probably has missing parent downlink */
bt_downlink_missing_check(state, rightsplit, blkno, page);
}
actually it should be
if (blkno != downlink && !P_IGNORE(opaque))
{
/* blkno probably has missing parent downlink */
bt_downlink_missing_check(state, rightsplit, blkno, page);
}
So "blkno == downlink" means blkno has downlink, not being checked
first in the loop. This is remains of old version of patch which I
forget to clean. Now the check you've described works for me.
If you still think lowkey check is a problem, please help me figure it out.
> * Can we reverse the order here, so that the common case (i.e. the
> offset_is_negative_infinity() case) comes first?:
>
> > + if (offset_is_negative_infinity(topaque, pivotkey_offset))
> > + {
> > + /*
> > + * We're going to try match child high key to "negative
> > + * infinity key". That means we should match to high key of
> > + * left sibling of target page.
> > + */
> *** SNIP ***
>
> > + }
> > + else
> > + {
> *** SNIP ***
>
> > + }
>
> * Can the offset_is_negative_infinity() branch explain what's going on
> at a much higher level than this?
Sure, comment is revised!
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
amcheck-btree-improve-missing-parent-downlinks-check-7.patch | application/octet-stream | 31.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2020-02-24 22:59:37 | Re: Error on failed COMMIT |
Previous Message | Vik Fearing | 2020-02-24 22:37:45 | Re: Error on failed COMMIT |