From: | Amul Sul <sulamul(at)gmail(dot)com> |
---|---|
To: | Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: new heapcheck contrib module |
Date: | 2020-07-21 06:50:52 |
Message-ID: | CAAJ_b94neVOO4PkJLFWDFN3oJL3B02a29m6_A2sJ+w127J3Hfg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <sulamul(at)gmail(dot)com> wrote:
>
> Hi Mark,
>
> I think new structures should be listed in src/tools/pgindent/typedefs.list,
> otherwise, pgindent might disturb its indentation.
>
> Regards,
> Amul
>
>
> On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> >
> >
> >
> > > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> > >> The v10 patch without these ideas is here:
> > >
> > > Along the lines of what Alvaro was saying before, I think this
> > > definitely needs to be split up into a series of patches. The commit
> > > message for v10 describes it doing three pretty separate things, and I
> > > think that argues for splitting it into a series of three patches. I'd
> > > argue for this ordering:
> > >
> > > 0001 Refactoring existing amcheck btree checking functions to optionally
> > > return corruption information rather than ereport'ing it. This is
> > > used by the new pg_amcheck command line tool for reporting back to
> > > the caller.
> > >
> > > 0002 Adding new function verify_heapam for checking a heap relation and
> > > associated toast relation, if any, to contrib/amcheck.
> > >
> > > 0003 Adding new contrib module pg_amcheck, which is a command line
> > > interface for running amcheck's verifications against tables and
> > > indexes.
> > >
> > > It's too hard to review things like this when it's all mixed together.
> >
> > The v11 patch series is broken up as you suggest.
> >
> > > +++ b/contrib/amcheck/t/skipping.pl
> > >
> > > The name of this file is inconsistent with the tree's usual
> > > convention, which is all stuff like 001_whatever.pl, except for
> > > src/test/modules/brin, which randomly decided to use two digits
> > > instead of three. There's no precedent for a test file with no leading
> > > numeric digits. Also, what does "skipping" even have to do with what
> > > the test is checking? Maybe it's intended to refer to the new error
> > > handling "skipping" the actual error in favor of just reporting it
> > > without stopping, but that's not really what the word "skipping"
> > > normally means. Finally, it seems a bit over-engineered: do we really
> > > need 183 test cases to check that detecting a problem doesn't lead to
> > > an abort? Like, if that's the purpose of the test, I'd expect it to
> > > check one corrupt relation and one non-corrupt relation, each with and
> > > without the no-error behavior. And that's about it. Or maybe it's
> > > talking about skipping pages during the checks, because those pages
> > > are all-visible or all-frozen? It's not very clear to me what's going
> > > on here.
> >
> > The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks. I have renamed it 001_verify_heapam.pl, since it tests that function.
> >
> > >
> > > + TransactionId nextKnownValidXid;
> > > + TransactionId oldestValidXid;
> > >
> > > Please add explanatory comments indicating what these are intended to
> > > mean.
> >
> > Done.
> >
> > > For most of the the structure members, the brief comments
> > > already present seem sufficient; but here, more explanation looks
> > > necessary and less is provided. The "Values for returning tuples"
> > > could possibly also use some more detail.
> >
> > Ok, I've expanded the comments for these.
> >
> > > +#define HEAPCHECK_RELATION_COLS 8
> > >
> > > I think this should really be at the top of the file someplace.
> > > Sometimes people have adopted this style when the #define is only used
> > > within the function that contains it, but that's not the case here.
> >
> > Done.
> >
> > >
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > > + errhint("please choose from 'all visible', 'all frozen', "
> > > + "or NULL")));
> > >
> > > I think it would be better if we had three string values selecting the
> > > different behaviors, and made the parameter NOT NULL but with a
> > > default. It seems like that would be easier to understand. Right now,
> > > I can tell that my options for what to skip are "all visible", "all
> > > frozen", and, uh, some other thing that I don't know what it is. I'm
> > > gonna guess the third option is to skip nothing, but it seems best to
> > > make that explicit. Also, should we maybe consider spelling this
> > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > > Spaces in an option value seems a little icky to me somehow.
> >
> > I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults to 'none'. I did not mark the function as strict, as I think NULL is a reasonable value (and the default) for startblock and endblock.
> >
> > > + int64 startblock = -1;
> > > + int64 endblock = -1;
> > > ...
> > > + if (!PG_ARGISNULL(3))
> > > + startblock = PG_GETARG_INT64(3);
> > > + if (!PG_ARGISNULL(4))
> > > + endblock = PG_GETARG_INT64(4);
> > > ...
> > > + if (startblock < 0)
> > > + startblock = 0;
> > > + if (endblock < 0 || endblock > ctx.nblocks)
> > > + endblock = ctx.nblocks;
> > > +
> > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> > >
> > > So, the user can specify a negative value explicitly and it will be
> > > treated as the default, and an endblock value that's larger than the
> > > relation size will be treated as the relation size. The way pg_prewarm
> > > does the corresponding checks seems superior: null indicates the
> > > default value, and any non-null value must be within range or you get
> > > an error. Also, you seem to be treating endblock as the first block
> > > that should not be checked, whereas pg_prewarm takes what seems to me
> > > to be the more natural interpretation: the end block is the last block
> > > that IS checked. If you do it this way, then someone who specifies the
> > > same start and end block will check no blocks -- silently, I think.
> >
> > Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block", and for relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception. I don't like that too much, but I'm happy to defer to precedent. Since you say pg_prewarm works this way (I did not check), I have changed verify_heapam to do likewise.
> >
> > > + if (skip_all_frozen || skip_all_visible)
> > >
> > > Since you can't skip all frozen without skipping all visible, this
> > > test could be simplified. Or you could introduce a three-valued enum
> > > and test that skip_pages != SKIP_PAGES_NONE, which might be even
> > > better.
> >
> > It works now with a three-valued enum.
> >
> > > + /* We must unlock the page from the prior iteration, if any */
> > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
> > >
> > > I don't understand this assertion, and I don't understand the comment,
> > > either. I think ctx.blkno can never be equal to InvalidBlockNumber
> > > because we never set it to anything outside the range of 0..(endblock
> > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> > > because we just initialized it by calling ReadBufferExtended(). So I
> > > think this assertion would still pass if we wrote && rather than ||.
> > > But even then, I don't know what that has to do with the comment or
> > > why it even makes sense to have an assertion for that in the first
> > > place.
> >
> > Yes, it is vestigial. Removed.
> >
> > > + /*
> > > + * Open the relation. We use ShareUpdateExclusive to prevent concurrent
> > > + * vacuums from changing the relfrozenxid, relminmxid, or advancing the
> > > + * global oldestXid to be newer than those. This protection
> > > saves us from
> > > + * having to reacquire the locks and recheck those minimums for every
> > > + * tuple, which would be expensive.
> > > + */
> > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> > >
> > > I don't think we'd need to recheck for every tuple, would we? Just for
> > > cases where there's an apparent violation of the rules.
> >
> > It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and arbitrarily much. It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how often you check. Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really defensible choice for that. I removed the offending sentence.
> >
> > > I guess that
> > > could still be expensive if there's a lot of them, but needing
> > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > > unfortunate.
> >
> > I welcome strategies that would allow for taking a lesser lock.
> >
> > > It's also unclear to me why this concerns itself with relfrozenxid and
> > > the cluster-wide oldestXid value but not with datfrozenxid. It seems
> > > like if we're going to sanity-check the relfrozenxid against the
> > > cluster-wide value, we ought to also check it against the
> > > database-wide value. Checking neither would also seem like a plausible
> > > choice. But it seems very strange to only check against the
> > > cluster-wide value.
> >
> > If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid. Otherwise, each row needs to be compared against some other minimum xid value.
> >
> > Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at least as old as any valid row in the table and no older than the oldest valid xid for the cluster.
> >
> > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them correctly, the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in pg_upgrade. This is awful. If I compare the xid of a row in a table against the oldest xid value for the database, and the xid of the row is older, what can I do? I don't have a principled basis for determining which one of them is wrong.
> >
> > The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but if it does report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding. I think this is a good choice; a tool with only false negatives is much more useful than one with both false positives and false negatives.
> >
> > I have added a comment about my reasoning to verify_heapam.c. I'm happy to be convinced of a better strategy for handling this situation.
> >
> > >
> > > + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> > > + "InvalidOffsetNumber
> > > increments to FirstOffsetNumber");
> > >
> > > If you are going to rely on this property, I agree that it is good to
> > > check it. But it would be better to NOT rely on this property, and I
> > > suspect the code can be written quite cleanly without relying on it.
> > > And actually, that's what you did, because you first set ctx.offnum =
> > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > > the loop initializer. So AFAICS the first initializer, and the static
> > > assert, are pointless.
> >
> > Ah, right you are. Removed.
> >
> > >
> > > + if (ItemIdIsRedirected(ctx.itemid))
> > > + {
> > > + uint16 redirect = ItemIdGetRedirect(ctx.itemid);
> > > + if (redirect <= SizeOfPageHeaderData
> > > || redirect >= ph->pd_lower)
> > > ...
> > > + if ((redirect - SizeOfPageHeaderData)
> > > % sizeof(uint16))
> > >
> > > I think that ItemIdGetRedirect() returns an offset, not a byte
> > > position. So the expectation that I would have is that it would be any
> > > integer >= 0 and <= maxoff. Am I confused?
> >
> > I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive. I have updated the checks.
> >
> > > BTW, it seems like it might
> > > be good to complain if the item to which it points is LP_UNUSED...
> > > AFAIK that shouldn't happen.
> >
> > Thanks for mentioning that. It now checks for that.
> >
> > > + errmsg("\"%s\" is not a heap AM",
> > >
> > > I think the correct wording would be just "is not a heap." The "heap
> > > AM" is the thing in pg_am, not a specific table.
> >
> > Fixed.
> >
> > > +confess(HeapCheckContext * ctx, char *msg)
> > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> > >
> > > This is what happens when you pgindent without adding all the right
> > > things to typedefs.list first ... or when you don't pgindent and have
> > > odd ideas about how to indent things.
> >
> > Hmm. I don't see the three lines of code you are quoting. Which patch is that from?
> >
> > >
> > > + /*
> > > + * In principle, there is nothing to prevent a scan over a large, highly
> > > + * corrupted table from using workmem worth of memory building up the
> > > + * tuplestore. Don't leak the msg argument memory.
> > > + */
> > > + pfree(msg);
> > >
> > > Maybe change the second sentence to something like: "That should be
> > > OK, else the user can lower work_mem, but we'd better not leak any
> > > additional memory."
> >
> > It may be a little wordy, but I went with
> >
> > /*
> > * In principle, there is nothing to prevent a scan over a large, highly
> > * corrupted table from using workmem worth of memory building up the
> > * tuplestore. That's ok, but if we also leak the msg argument memory
> > * until the end of the query, we could exceed workmem by more than a
> > * trivial amount. Therefore, free the msg argument each time we are
> > * called rather than waiting for our current memory context to be freed.
> > */
> >
> > > +/*
> > > + * check_tuphdr_xids
> > > + *
> > > + * Determine whether tuples are visible for verification. Similar to
> > > + * HeapTupleSatisfiesVacuum, but with critical differences.
> > > + *
> > > + * 1) Does not touch hint bits. It seems imprudent to write hint bits
> > > + * to a table during a corruption check.
> > > + * 2) Only makes a boolean determination of whether verification should
> > > + * see the tuple, rather than doing extra work for vacuum-related
> > > + * categorization.
> > > + *
> > > + * The caller should already have checked that xmin and xmax are not out of
> > > + * bounds for the relation.
> > > + */
> > >
> > > First, check_tuphdr_xids() doesn't seem like a very good name. If you
> > > have a function with that name and, like this one, it returns Boolean,
> > > what does true mean? What does false mean? Kinda hard to tell. And
> > > also, check the tuple header XIDs *for what*? If you called it, say,
> > > tuple_is_visible(), that would be self-evident.
> >
> > Changed.
> >
> > > Second, consider that we hold at least AccessShareLock on the relation
> > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> > > cannot be a concurrent modification to the tuple descriptor in
> > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> > > potentially using a non-current schema. If the tuple is
> > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> > > inserting transaction, or that transaction committed before we got our
> > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> > > before we got our lock. Or if it's both inserted and deleted in the
> > > same transaction, say, then that transaction committed before we got
> > > our lock or else contains no relevant DDL. IOW, I think you can check
> > > everything but dead tuples here.
> >
> > Ok, I have changed tuple_is_visible to return true rather than false for those other cases.
> >
> > > Capitalization and punctuation for messages complaining about problems
> > > need to be consistent. verify_heapam() has "Invalid redirect line
> > > pointer offset %u out of bounds" which starts with a capital letter,
> > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> > > case, but in any event it should be the same.
> >
> > I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone.
> >
> > > Also,
> > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> > > debugging leftover or a very unclear complaint.
> >
> > Right. That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation".
> >
> > > I think some real work
> > > needs to be put into the phrasing of these messages so that it's more
> > > clear exactly what is going on and why it's bad. For example the first
> > > example in this paragraph is clearly a problem of some kind, but it's
> > > not very clear exactly what is happening: is %u the offset of the
> > > invalid line redirect or the value to which it points? I don't think
> > > the phrasing is very grammatical, which makes it hard to tell which is
> > > meant, and I actually think it would be a good idea to include both
> > > things.
> >
> > Beware that every row returned from amcheck has more fields than just the error message.
> >
> > blkno OUT bigint,
> > offnum OUT integer,
> > lp_off OUT smallint,
> > lp_flags OUT smallint,
> > lp_len OUT smallint,
> > attnum OUT integer,
> > chunk OUT integer,
> > msg OUT text
> >
> > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better to remove these things from messages that include them. For the specific message under consideration, I've converted the text to "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u". That avoids duplicating the offset information of the referring item, while reporting to offset of the referred item.
> >
> > > Project policy is generally against splitting a string across multiple
> > > lines to fit within 80 characters. We like to fit within 80
> > > characters, but we like to be able to grep for strings more, and
> > > breaking them up like this makes that harder.
> >
> > Thanks for clarifying the project policy. I joined these message strings back together.
In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to
be joined.
e.g:
+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot "
+ "accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("materialize mode required, but it is not allowed "
+ "in this context")));
> >
> > > + confess(ctx,
> > > + pstrdup("corrupt toast chunk va_header"));
> > >
> > > This is another message that I don't think is very clear. There's two
> > > elements to that. One is that the phrasing is not very good, and the
> > > other is that there are no % escapes
> >
> > Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x". I think all the other information about where the corruption was found is already present in the other returned columns.
> >
> > > What's somebody going to do when
> > > they see this message? First, they're probably going to have to look
> > > at the code to figure out in which circumstances it gets generated;
> > > that's a sign that the message isn't phrased clearly enough. That will
> > > tell them that an unexpected bit pattern has been found, but not what
> > > that unexpected bit pattern actually was. So then, they're going to
> > > have to try to find the relevant va_header by some other means and
> > > fish out the relevant bit so that they can see what actually went
> > > wrong.
> >
> > Right.
> >
> > >
> > > + * Checks the current attribute as tracked in ctx for corruption. Records
> > > + * any corruption found in ctx->corruption.
> > > + *
> > > + *
> > >
> > > Extra blank line.
> >
> > Fixed.
> >
> > > + Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
> > > +
> > > ctx->attnum);
> > >
> > > Maybe you could avoid the line wrap by declaring this without
> > > initializing it, and then initializing it as a separate statement.
> >
> > Yes, I like that better. I did not need to do the same with infomask, but it looks better to me to break the declaration and initialization for both, so I did that.
> >
> > >
> > > + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
> > > +
> > > ctx->tuphdr->t_hoff, ctx->offset,
> > > + ctx->lp_len));
> > >
> > > Uggh! This isn't even remotely an English sentence. I don't think
> > > formulas are the way to go here, but I like the idea of formulas in
> > > some places and written-out messages in others even less. I guess the
> > > complaint here in English is something like "tuple attribute %d should
> > > start at offset %u, but tuple length is only %u" or something of that
> > > sort. Also, it seems like this complaint really ought to have been
> > > reported on the *preceding* loop iteration, either complaining that
> > > (1) the fixed length attribute is more than the number of remaining
> > > bytes in the tuple or (2) the varlena header for the tuple specifies
> > > an excessively high length. It seems like you're blaming the wrong
> > > attribute for the problem.
> >
> > Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attribute to blame it on. I've changed it to report as you suggest, although it also still complains if the first attribute starts outside the bounds of the tuple. The two error messages now read as "tuple attribute should start at offset %u, but tuple length is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u".
> >
> > > BTW, the header comments for this function (check_tuple_attribute)
> > > neglect to document the meaning of the return value.
> >
> > Fixed.
> >
> > > + confess(ctx, psprintf("tuple xmax = %u
> > > precedes relation "
> > > +
> > > "relfrozenxid = %u",
> > >
> > > This is another example of these messages needing work. The
> > > corresponding message from heap_prepare_freeze_tuple() is "found
> > > update xid %u from before relfrozenxid %u". That's better, because we
> > > don't normally include equals signs in our messages like this, and
> > > also because "relation relfrozenxid" is redundant. I think this should
> > > say something like "tuple xmax %u precedes relfrozenxid %u".
> > >
> > > + confess(ctx, psprintf("tuple xmax = %u is in
> > > the future",
> > > + xmax));
> > >
> > > And then this could be something like "tuple xmax %u follows
> > > last-assigned xid %u". That would be more symmetric and more
> > > informative.
> >
> > Both of these have been changed.
> >
> > > + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> > > ctx->tuphdr->t_hoff)
> > >
> > > I think we should be able to predict the exact value of t_hoff and
> > > complain if it isn't precisely equal to the expected value. Or is that
> > > not possible for some reason?
> >
> > That is possible, and I've updated the error message to match. There are cases where you can't know if the HEAP_HASNULL bit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on the HEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the expected. That sidesteps the problem of not knowing exactly which value to blame.
> >
> > > Is there some place that's checking that lp_len >=
> > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> > > the header? If not, there should be.
> >
> > Good catch. check_tuple() now does that before reading the header.
> >
> > > +$node->command_ok(
> > >
> > > + [
> > > + 'pg_amcheck', '-p', $port, 'postgres'
> > > + ],
> > > + 'pg_amcheck all schemas and tables implicitly');
> > > +
> > > +$node->command_ok(
> > > + [
> > > + 'pg_amcheck', '-i', '-p', $port, 'postgres'
> > > + ],
> > > + 'pg_amcheck all schemas, tables and indexes');
> > >
> > > I haven't really looked through the btree-checking and pg_amcheck
> > > parts of this much yet, but this caught my eye. Why would the default
> > > be to check tables but not indexes? I think the default ought to be to
> > > check everything we know how to check.
> >
> > I have changed the default to match your expectations.
> >
> >
> >
> > —
> > Mark Dilger
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >
> >
> >
From | Date | Subject | |
---|---|---|---|
Next Message | @usernamedt | 2020-07-21 07:51:56 | WAL segment switch on pg_start_backup() |
Previous Message | Michael Paquier | 2020-07-21 05:32:31 | Re: OpenSSL 3.0.0 compatibility |