Re: new heapcheck contrib module

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-07-16 19:38:19
Message-ID: CA+TgmobQUdfM2Q7WAR606RR2YehCZ5GHRKBc-VWfRj-1ZUKqDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> The v10 patch without these ideas is here:

Along the lines of what Alvaro was saying before, I think this
definitely needs to be split up into a series of patches. The commit
message for v10 describes it doing three pretty separate things, and I
think that argues for splitting it into a series of three patches. I'd
argue for this ordering:

0001 Refactoring existing amcheck btree checking functions to optionally
return corruption information rather than ereport'ing it. This is
used by the new pg_amcheck command line tool for reporting back to
the caller.

0002 Adding new function verify_heapam for checking a heap relation and
associated toast relation, if any, to contrib/amcheck.

0003 Adding new contrib module pg_amcheck, which is a command line
interface for running amcheck's verifications against tables and
indexes.

It's too hard to review things like this when it's all mixed together.

+++ b/contrib/amcheck/t/skipping.pl

The name of this file is inconsistent with the tree's usual
convention, which is all stuff like 001_whatever.pl, except for
src/test/modules/brin, which randomly decided to use two digits
instead of three. There's no precedent for a test file with no leading
numeric digits. Also, what does "skipping" even have to do with what
the test is checking? Maybe it's intended to refer to the new error
handling "skipping" the actual error in favor of just reporting it
without stopping, but that's not really what the word "skipping"
normally means. Finally, it seems a bit over-engineered: do we really
need 183 test cases to check that detecting a problem doesn't lead to
an abort? Like, if that's the purpose of the test, I'd expect it to
check one corrupt relation and one non-corrupt relation, each with and
without the no-error behavior. And that's about it. Or maybe it's
talking about skipping pages during the checks, because those pages
are all-visible or all-frozen? It's not very clear to me what's going
on here.

+ TransactionId nextKnownValidXid;
+ TransactionId oldestValidXid;

Please add explanatory comments indicating what these are intended to
mean. For most of the the structure members, the brief comments
already present seem sufficient; but here, more explanation looks
necessary and less is provided. The "Values for returning tuples"
could possibly also use some more detail.

+#define HEAPCHECK_RELATION_COLS 8

I think this should really be at the top of the file someplace.
Sometimes people have adopted this style when the #define is only used
within the function that contains it, but that's not the case here.

+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("unrecognized parameter for 'skip': %s", skip),
+ errhint("please choose from 'all visible', 'all frozen', "
+ "or NULL")));

I think it would be better if we had three string values selecting the
different behaviors, and made the parameter NOT NULL but with a
default. It seems like that would be easier to understand. Right now,
I can tell that my options for what to skip are "all visible", "all
frozen", and, uh, some other thing that I don't know what it is. I'm
gonna guess the third option is to skip nothing, but it seems best to
make that explicit. Also, should we maybe consider spelling this
'all-visible' and 'all-frozen' with dashes, instead of using spaces?
Spaces in an option value seems a little icky to me somehow.

+ int64 startblock = -1;
+ int64 endblock = -1;
...
+ if (!PG_ARGISNULL(3))
+ startblock = PG_GETARG_INT64(3);
+ if (!PG_ARGISNULL(4))
+ endblock = PG_GETARG_INT64(4);
...
+ if (startblock < 0)
+ startblock = 0;
+ if (endblock < 0 || endblock > ctx.nblocks)
+ endblock = ctx.nblocks;
+
+ for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)

So, the user can specify a negative value explicitly and it will be
treated as the default, and an endblock value that's larger than the
relation size will be treated as the relation size. The way pg_prewarm
does the corresponding checks seems superior: null indicates the
default value, and any non-null value must be within range or you get
an error. Also, you seem to be treating endblock as the first block
that should not be checked, whereas pg_prewarm takes what seems to me
to be the more natural interpretation: the end block is the last block
that IS checked. If you do it this way, then someone who specifies the
same start and end block will check no blocks -- silently, I think.

+ if (skip_all_frozen || skip_all_visible)

Since you can't skip all frozen without skipping all visible, this
test could be simplified. Or you could introduce a three-valued enum
and test that skip_pages != SKIP_PAGES_NONE, which might be even
better.

+ /* We must unlock the page from the prior iteration, if any */
+ Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);

I don't understand this assertion, and I don't understand the comment,
either. I think ctx.blkno can never be equal to InvalidBlockNumber
because we never set it to anything outside the range of 0..(endblock
- 1), and I think ctx.buffer must always be unequal to InvalidBuffer
because we just initialized it by calling ReadBufferExtended(). So I
think this assertion would still pass if we wrote && rather than ||.
But even then, I don't know what that has to do with the comment or
why it even makes sense to have an assertion for that in the first
place.

+ /*
+ * Open the relation. We use ShareUpdateExclusive to prevent concurrent
+ * vacuums from changing the relfrozenxid, relminmxid, or advancing the
+ * global oldestXid to be newer than those. This protection
saves us from
+ * having to reacquire the locks and recheck those minimums for every
+ * tuple, which would be expensive.
+ */
+ ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);

I don't think we'd need to recheck for every tuple, would we? Just for
cases where there's an apparent violation of the rules. I guess that
could still be expensive if there's a lot of them, but needing
ShareUpdateExclusiveLock rather than only AccessShareLock is a little
unfortunate.

It's also unclear to me why this concerns itself with relfrozenxid and
the cluster-wide oldestXid value but not with datfrozenxid. It seems
like if we're going to sanity-check the relfrozenxid against the
cluster-wide value, we ought to also check it against the
database-wide value. Checking neither would also seem like a plausible
choice. But it seems very strange to only check against the
cluster-wide value.

+ StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
+ "InvalidOffsetNumber
increments to FirstOffsetNumber");

If you are going to rely on this property, I agree that it is good to
check it. But it would be better to NOT rely on this property, and I
suspect the code can be written quite cleanly without relying on it.
And actually, that's what you did, because you first set ctx.offnum =
InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
the loop initializer. So AFAICS the first initializer, and the static
assert, are pointless.

+ if (ItemIdIsRedirected(ctx.itemid))
+ {
+ uint16 redirect = ItemIdGetRedirect(ctx.itemid);
+ if (redirect <= SizeOfPageHeaderData
|| redirect >= ph->pd_lower)
...
+ if ((redirect - SizeOfPageHeaderData)
% sizeof(uint16))

I think that ItemIdGetRedirect() returns an offset, not a byte
position. So the expectation that I would have is that it would be any
integer >= 0 and <= maxoff. Am I confused? BTW, it seems like it might
be good to complain if the item to which it points is LP_UNUSED...
AFAIK that shouldn't happen.

+ errmsg("\"%s\" is not a heap AM",

I think the correct wording would be just "is not a heap." The "heap
AM" is the thing in pg_am, not a specific table.

+confess(HeapCheckContext * ctx, char *msg)
+TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
+check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)

This is what happens when you pgindent without adding all the right
things to typedefs.list first ... or when you don't pgindent and have
odd ideas about how to indent things.

+ /*
+ * In principle, there is nothing to prevent a scan over a large, highly
+ * corrupted table from using workmem worth of memory building up the
+ * tuplestore. Don't leak the msg argument memory.
+ */
+ pfree(msg);

Maybe change the second sentence to something like: "That should be
OK, else the user can lower work_mem, but we'd better not leak any
additional memory."

+/*
+ * check_tuphdr_xids
+ *
+ * Determine whether tuples are visible for verification. Similar to
+ * HeapTupleSatisfiesVacuum, but with critical differences.
+ *
+ * 1) Does not touch hint bits. It seems imprudent to write hint bits
+ * to a table during a corruption check.
+ * 2) Only makes a boolean determination of whether verification should
+ * see the tuple, rather than doing extra work for vacuum-related
+ * categorization.
+ *
+ * The caller should already have checked that xmin and xmax are not out of
+ * bounds for the relation.
+ */

First, check_tuphdr_xids() doesn't seem like a very good name. If you
have a function with that name and, like this one, it returns Boolean,
what does true mean? What does false mean? Kinda hard to tell. And
also, check the tuple header XIDs *for what*? If you called it, say,
tuple_is_visible(), that would be self-evident.

Second, consider that we hold at least AccessShareLock on the relation
- actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
cannot be a concurrent modification to the tuple descriptor in
progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
potentially using a non-current schema. If the tuple is
HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
inserting transaction, or that transaction committed before we got our
lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
before we got our lock. Or if it's both inserted and deleted in the
same transaction, say, then that transaction committed before we got
our lock or else contains no relevant DDL. IOW, I think you can check
everything but dead tuples here.

Capitalization and punctuation for messages complaining about problems
need to be consistent. verify_heapam() has "Invalid redirect line
pointer offset %u out of bounds" which starts with a capital letter,
but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
case, but in any event it should be the same. Also,
check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
debugging leftover or a very unclear complaint. I think some real work
needs to be put into the phrasing of these messages so that it's more
clear exactly what is going on and why it's bad. For example the first
example in this paragraph is clearly a problem of some kind, but it's
not very clear exactly what is happening: is %u the offset of the
invalid line redirect or the value to which it points? I don't think
the phrasing is very grammatical, which makes it hard to tell which is
meant, and I actually think it would be a good idea to include both
things.

Project policy is generally against splitting a string across multiple
lines to fit within 80 characters. We like to fit within 80
characters, but we like to be able to grep for strings more, and
breaking them up like this makes that harder.

+ confess(ctx,
+ pstrdup("corrupt toast chunk va_header"));

This is another message that I don't think is very clear. There's two
elements to that. One is that the phrasing is not very good, and the
other is that there are no % escapes. What's somebody going to do when
they see this message? First, they're probably going to have to look
at the code to figure out in which circumstances it gets generated;
that's a sign that the message isn't phrased clearly enough. That will
tell them that an unexpected bit pattern has been found, but not what
that unexpected bit pattern actually was. So then, they're going to
have to try to find the relevant va_header by some other means and
fish out the relevant bit so that they can see what actually went
wrong.

+ * Checks the current attribute as tracked in ctx for corruption. Records
+ * any corruption found in ctx->corruption.
+ *
+ *

Extra blank line.

+ Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
+
ctx->attnum);

Maybe you could avoid the line wrap by declaring this without
initializing it, and then initializing it as a separate statement.

+ confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
+
ctx->tuphdr->t_hoff, ctx->offset,
+ ctx->lp_len));

Uggh! This isn't even remotely an English sentence. I don't think
formulas are the way to go here, but I like the idea of formulas in
some places and written-out messages in others even less. I guess the
complaint here in English is something like "tuple attribute %d should
start at offset %u, but tuple length is only %u" or something of that
sort. Also, it seems like this complaint really ought to have been
reported on the *preceding* loop iteration, either complaining that
(1) the fixed length attribute is more than the number of remaining
bytes in the tuple or (2) the varlena header for the tuple specifies
an excessively high length. It seems like you're blaming the wrong
attribute for the problem.

BTW, the header comments for this function (check_tuple_attribute)
neglect to document the meaning of the return value.

+ confess(ctx, psprintf("tuple xmax = %u
precedes relation "
+
"relfrozenxid = %u",

This is another example of these messages needing work. The
corresponding message from heap_prepare_freeze_tuple() is "found
update xid %u from before relfrozenxid %u". That's better, because we
don't normally include equals signs in our messages like this, and
also because "relation relfrozenxid" is redundant. I think this should
say something like "tuple xmax %u precedes relfrozenxid %u".

+ confess(ctx, psprintf("tuple xmax = %u is in
the future",
+ xmax));

And then this could be something like "tuple xmax %u follows
last-assigned xid %u". That would be more symmetric and more
informative.

+ if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
ctx->tuphdr->t_hoff)

I think we should be able to predict the exact value of t_hoff and
complain if it isn't precisely equal to the expected value. Or is that
not possible for some reason?

Is there some place that's checking that lp_len >=
SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
the header? If not, there should be.

+$node->command_ok(

+ [
+ 'pg_amcheck', '-p', $port, 'postgres'
+ ],
+ 'pg_amcheck all schemas and tables implicitly');
+
+$node->command_ok(
+ [
+ 'pg_amcheck', '-i', '-p', $port, 'postgres'
+ ],
+ 'pg_amcheck all schemas, tables and indexes');

I haven't really looked through the btree-checking and pg_amcheck
parts of this much yet, but this caught my eye. Why would the default
be to check tables but not indexes? I think the default ought to be to
check everything we know how to check.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2020-07-16 19:46:03 Re: Encoding of src/timezone/tznames/Europe.txt
Previous Message Tom Lane 2020-07-16 19:29:45 NaN divided by zero should yield NaN