Quick Links

Re: new heapcheck contrib module

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: new heapcheck contrib module
Date:	2021-01-07 07:05:27
Message-ID:	B5CF5BFF-0A92-4792-868E-3DB4E8F1C389@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> On Nov 19, 2020, at 9:06 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Oct 26, 2020 at 12:12 PM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase. (0001 was already committed last week.)
>>
>> Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming. There are no substantial changes.
>
> Here's a review of 0002. I basically like the direction this is going
> but I guess nobody will be surprised that there are some things in
> here that I think could be improved.

Thanks for the review!

The tools pg_dump and pg_amcheck both need to allow the user to specify which schemas, tables, and indexes either to dump or to check. There are command line options in pg_dump for this purpose, and functions for compiling lists of corresponding database objects. In prior versions of the pg_amcheck patch, I did some copy-and-pasting of this logic, and then had to fix up the copied functions a bit, given that pg_dump has its own ecosystem with things like fatal() and exit_nicely() and such.

In hindsight, it would have been better to factor these functions out into a shared location. I have done that, factoring them into fe_utils, and am attaching a series of patches that accomplishes that refactoring. Here are some brief explanations of what these are for. See also the commit comments in each patch:

v3-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch

pg_dump allows on-exit callbacks to be registered, which it expects to get called when exit_nicely() is invoked. It doesn't work to factor functions out of pg_dump without having this infrastructure, as the functions being factored out include facilities for logging and exiting on error. Therefore, moving these functions into fe_utils.

v3-0002-Refactoring-ExecuteSqlQuery-and-related-functions.patch

pg_dump has functions for running queries, but those functions take a pg_dump specific argument of type Archive rather than PGconn, with the expectation that the Archive's connection will be used. This has to be cleaned up a bit before these functions can be moved out of pg_dump to a shared location. Also, pg_dump has a fixed expectation that when a query fails, specific steps will be taken to print out the error information and exit. That's reasonable behavior, but not all callers will want that. Since the ultimate goal of this refactoring is to have higher level functions that translate shell patterns into oid lists, it's reasonable to imagine that not all callers will want to exit if the query fails. In particular, pg_amcheck won't want errors to automatically trigger exit() calls, given that pg_amcheck tries to continue in the face of errors. Therefore, adding a default error handler that does what pg_dump expects, but with an eye towards other callers being able to define handlers that behave differently.

v3-0003-Creating-query_utils-frontend-utility.patch

Moving the refactored functions to the shared location in fe_utils. This is kept separate from 0002 for ease of review.

v3-0004-Adding-CurrentQueryHandler-logic.patch

Extending the query error handling logic begun in the 0002 patch. It wasn't appropriate in the pg_dump project, but now the logic is in fe_utils.

v3-0005-Refactoring-pg_dumpall-functions.patch

Refactoring some remaining functions in the pg_dump project to use the new fe_utils facilities.

v3-0006-Refactoring-expand_schema_name_patterns-and-frien.patch

Refactoring functions in pg_dump that expand a list of patterns into a list of matching database objects. Specifically, changing them to no take pg_dump specific argument types, just as was done in 0002.

v3-0007-Moving-pg_dump-functions-to-new-file-option_utils.patch

Moving the functions refactored in 0006 into a new location fe_utils/option_utils

v3-0008-Normalizing-option_utils-interface.patch

Reworking the functions moved in 0007 to have a more general purpose interface. The refactoring in 0006 only went so far as to make the functions moveable out of pg_dump. This refactoring is intentionally kept separate for ease of review.

v3-0009-Adding-contrib-module-pg_amcheck.patch

Adding contrib/pg_amcheck project, about which your review comments below apply.

Not included in this patch set, but generated during the development of this patch set, I refactored processSQLNamePattern. string_utils mixes the logic for converting a shell-style pattern into a SQL style regex with the logic of performing the sql query to look up matching database objects. That makes it hard to look up multiple patterns in a single query, something that an intermediate version of this patch set was doing. I ultimately stopped doing that, as the code was overly complex, but the refactoring of processSQLNamePattern is not over-complicated and probably has some merit in its own right. Since it is not related to the pg_amcheck code, I expect that I will be posting that separately.

Also not included in this patch set, but likely to be in the next rev, is a patch that adds more interesting table and index corruption via PostgresNode, creating torn pages and such. That work is complete so far as I know, but I don't have all the regression tests that use it written yet, so I'll hold off posting it for now.

Not yet written but still needed is the parallelization of the checking. I'll be working on that for the next patch set.

There is enough work here in need of review that I'm posting this now, hoping to get feedback on the general direction I'm going with this.

To your review....

>
> +const char *usage_text[] = {
> + "pg_amcheck is the PostgreSQL command line frontend for the
> amcheck database corruption checker.",
> + "",
>
> This looks like a novel approach to the problem of printing out the
> usage() information, and I think that it's inferior to the technique
> used elsewhere of just having a bunch of printf() statements, because
> unless I misunderstand, it doesn't permit localization.

Since contrib modules are not localized, it seemed not to be a problem, but you've raised the question of whether pg_amcheck might be moved into core. I've changed it as suggested so that such a move would incur less code churn. The advantage to how I had it before was that each line was a bit shorter, making it fit better into the 80 column limit.

> + " -b, --startblock begin checking table(s) at the
> given starting block number",
> + " -e, --endblock check table(s) only up to the
> given ending block number",
> + " -B, --toast-startblock begin checking toast table(s)
> at the given starting block",
> + " -E, --toast-endblock check toast table(s) only up
> to the given ending block",
>
> I am not very convinced by this. What's the use case? If you're just
> checking a single table, you might want to specify a start and end
> block, but then you don't need separate options for the TOAST and
> non-TOAST cases, do you? If I want to check pg_statistic, I'll say
> pg_amcheck -t pg_catalog.pg_statistic. If I want to check the TOAST
> table for pg_statistic, I'll say pg_amcheck -t pg_toast.pg_toast_2619.
> In either case, if I want to check just the first three blocks, I can
> add -b 0 -e 2.

Removed -B, --toast-startblock and -E, --toast-endblock.

>
> + " -f, --skip-all-frozen do NOT check blocks marked as
> all frozen",
> + " -v, --skip-all-visible do NOT check blocks marked as
> all visible",
>
> I think this is using up too many one character option names for too
> little benefit on things that are too closely related. How about, -s,
> --skip=all-frozen|all-visible|none?

I'm already using -s for "strict-names', but I implemented your suggestion with -S, --skip

> And then -v could mean verbose,
> which could trigger things like printing all the queries sent to the
> server, setting PQERRORS_VERBOSE, etc.

I added -v, --verbose as you suggest.

> + " -x, --check-indexes check btree indexes associated
> with tables being checked",
> + " -X, --skip-indexes do NOT check any btree indexes",
> + " -i, --index=PATTERN check the specified index(es) only",
> + " -I, --exclude-index=PATTERN do NOT check the specified index(es)",
>
> This is a lotta controls for something that has gotta have some
> default. Either the default is everything, in which case I don't see
> why I need -x, or it's nothing, in which case I don't see why I need
> -X.

I removed -x, --check-indexes and instead made that the default.

>
> + " -c, --check-corrupt check indexes even if their
> associated table is corrupt",
> + " -C, --skip-corrupt do NOT check indexes if their
> associated table is corrupt",
>
> Ditto. (I think the default be to check corrupt, and there can be an
> option to skip it.)

Likewise, I removed -c, --check-corrupt and made that the default.

> + " -a, --heapallindexed check index tuples against the
> table tuples",
> + " -A, --no-heapallindexed do NOT check index tuples
> against the table tuples",
>
> Ditto. (Not sure what the default should be, though.)

I removed -A, --no-heapallindexed and made that the default.

>
> + " -r, --rootdescend search from the root page for
> each index tuple",
> + " -R, --no-rootdescend do NOT search from the root
> page for each index tuple",
>
> Ditto. (Again, not sure about the default.)

I removed -R, --no-rootdescend and made that the default. Peter argued elsewhere for removing this altogether, but as I recall you argued against that, so for now I'm keeping the --rootdescend option.

> I'm also not sure if these descriptions are clear enough, but it may
> also be hard to do a good job in a brief space.

Yes. Better verbiage welcome.

> Still, comparing this
> to the documentation of heapallindexed makes me rather nervous. This
> is only trying to verify that the index contains all the tuples in the
> heap, not that the values in the heap and index tuples actually match.

This is complicated. The most reasonable approach from the point of view of somebody running pg_amcheck is to have the scan of the table and the scan of the index cooperate so that work is not duplicated. But from the point of view of amcheck (not pg_amcheck), there is no assumption that the table is being scanned just because the index is being checked. I'm not sure how best to resolve this, except that I'd rather punt this to a future version rather than require the first version of pg_amcheck to deal with it.

> +typedef struct
> +AmCheckSettings
> +{
> + char *dbname;
> + char *host;
> + char *port;
> + char *username;
> +} ConnectOptions;
>
> Making the struct name different from the type name seems not good,
> and the struct name also shouldn't be on a separate line.

Fixed.

> +typedef enum trivalue
> +{
> + TRI_DEFAULT,
> + TRI_NO,
> + TRI_YES
> +} trivalue;
>
> Ugh. It's not this patch's fault, but we really oughta move this to
> someplace more centralized.

Not changed in this patch.

> +typedef struct
> ...
> +} AmCheckSettings;
>
> I'm not sure I consider all of these things settings, "db" in
> particular. But maybe that's nitpicking.

It is definitely nitpicking, but I agree with it. This next patch uses a static variable named "conn" rather than "settings.db".

> +static void expand_schema_name_patterns(const SimpleStringList *patterns,
> +
> const SimpleOidList *exclude_oids,
> +
> SimpleOidList *oids
> +
> bool strict_names);
>
> This is copied from pg_dump, along with I think at least one other
> function from nearby. Unlike the trivalue case above, this would be
> the first duplication of this logic. Can we push this stuff into
> pgcommon, perhaps?

Yes, these functions were largely copied from pg_dump. I have moved them out of pg_dump and into fe_utils, but that was a large enough effort that it deserves its own thread, so I'm creating a thread for that work independent of this thread.

> + /*
> + * Default behaviors for user settable options. Note that these default
> + * to doing all the safe checks and none of the unsafe ones,
> on the theory
> + * that if a user says "pg_amcheck mydb" without specifying
> any additional
> + * options, we should check everything we know how to check without
> + * risking any backend aborts.
> + */
>
> This to me seems too conservative. The result is that by default we
> check only tables, not indexes. I don't think that's going to be what
> users want.

Checking indexes has been made the default, as discussed above.

> I don't know whether they want the heapallindexed or
> rootdescend behaviors for index checks, but I think they want their
> indexes checked. Happy to hear opinions from actual users on what they
> want; this is just me guessing that you've guessed wrong. :-)

The heapallindexed and rootdescend options still exist but are false by default.

> + if (settings.db == NULL)
> + {
> + pg_log_error("no connection to server after
> initial attempt");
> + exit(EXIT_BADCONN);
> + }
>
> I think this is documented as meaning out of memory, and reported that
> way elsewhere. Anyway I am going to keep complaining until there are
> no cases where we tell the user it broke without telling them what
> broke. Which means this bit is a problem too:
>
> + if (!settings.db)
> + {
> + pg_log_error("no connection to server");
> + exit(EXIT_BADCONN);
> + }
>
> Something went wrong, good luck figuring out what it was!

I have changed this to more closely follow the behavior in scripts/common.c:connectDatabase. If pg_amcheck were moved into src/bin/scripts, I could just use that function outright.

> + /*
> + * All information about corrupt indexes are returned via
> ereport, not as
> + * tuples. We want all the details to report if corruption exists.
> + */
> + PQsetErrorVerbosity(settings.db, PQERRORS_VERBOSE);
>
> Really? Why? If I need the source code file name, function name, and
> line number to figure out what went wrong, that is not a great sign
> for the quality of the error reports it produces.

Yeah, you are right about that. In any event, the user can now specifiy --verbose if they like and get that extra information (not that they need it). I have removed this offending bit of code.

> + /*
> + * The btree checking logic which optionally
> checks the contents
> + * of an index against the corresponding table
> has not yet been
> + * sufficiently hardened against corrupt
> tables. In particular,
> + * when called with heapallindexed true, it
> segfaults if the file
> + * backing the table relation has been
> erroneously unlinked. In
> + * any event, it seems unwise to reconcile an
> index against its
> + * table when we already know the table is corrupt.
> + */
> + old_heapallindexed = settings.heapallindexed;
> + if (corruptions)
> + settings.heapallindexed = false;
>
> This seems pretty lame to me. Even if the btree checker can't tolerate
> corruption to the extent that the heap checker does, seg faulting
> because of a missing file seems like a bug that we should just fix
> (and probably back-patch). I'm not very convinced by the decision to
> override the user's decision about heapallindexed either. Maybe I lack
> imagination, but that seems pretty arbitrary. Suppose there's a giant
> index which is missing entries for 5 million heap tuples and also
> there's 1 entry in the table which has an xmin that is less than the
> pg_clas.relfrozenxid value by 1. You are proposing that because I have
> the latter problem I don't want you to check for the former one. But
> I, John Q. Smartuser, do not want you to second-guess what I told you
> on the command line that I wanted. :-)

I've removed this bit. I'm not sure what I was seeing back when I first wrote this code, but I no longer see any segfaults for missing relation files.

> I think in general you're worrying too much about the possibility of
> this tool causing backend crashes. I think it's good that you wrote
> the heapcheck code in a way that's hardened against that, and I think
> we should try to harden other things as time permits. But I don't
> think that the remote possibility of a crash due to the lack of such
> hardening should dictate the design behavior of this tool. If the
> crash possibilities are not remote, then I think the solution is to
> fix them, rather than cutting out important checks.

Right. I've been worrying a bit less about this lately, in part because you and Peter are less concerned about it than I was, and in part because I've been banging away with various test cases and don't see all that much worth worrying about.

> It doesn't seem like great design to me that get_table_check_list()
> gets just the OID of the table itself, and then later if we decide to
> check the TOAST table we've got to run a separate query for each table
> we want to check to fetch the TOAST OID, when we could've just fetched
> both in get_table_check_list() by including two columns in the query
> rather than one and it would've been basically free. Imagine if some
> user wrote a query that fetched the primary key value for all their
> rows and then had their application run a separate query to fetch the
> entire contents of each of those rows, said contents consisting of one
> more integer. And then suppose they complained about performance. We'd
> tell them they were doing it wrong, and so here.

Good points. I've changed get_table_check_list to query both the main table and toast table oids as you suggest.

> + if (settings.db == NULL)
> + fatal("no connection on entry to check_table");
>
> Uninformative. Is this basically an Assert? If so maybe just make it
> one. If not maybe fail somewhere else with a better message?

Looking at this again, I don't think it is even worth making it into an Assert, so I just removed it, along with similar useless checks of the same type elsewhere.

>
> + if (startblock == NULL)
> + startblock = "NULL";
> + if (endblock == NULL)
> + endblock = "NULL";
>
> It seems like it would be more elegant to initialize
> settings.startblock and settings.endblock to "NULL." However, there's
> also a related problem, which is that the startblock and endblock
> values can be anything, and are interpolated with quoting. I don't
> think that it's good to ship a tool with SQL injection hazards built
> into it. I think that you should (a) check that these values are
> integers during argument parsing and error out if they are not and
> then (b) use either a prepared query or PQescapeLiteral() anyway.

I've changed the logic to use strtol to parse these, and I'm storing them as long rather than as strings.

> + stop = (on_error_stop) ? "true" : "false";
> + toast = (check_toast) ? "true" : "false";
>
> The parens aren't really needed here.

True. Removed.

> +
> printf("(relname=%s,blkno=%s,offnum=%s,attnum=%s)\n%s\n",
> + PQgetvalue(res, i, 0), /* relname */
> + PQgetvalue(res, i, 1), /* blkno */
> + PQgetvalue(res, i, 2), /* offnum */
> + PQgetvalue(res, i, 3), /* attnum */
> + PQgetvalue(res, i, 4)); /* msg */
>
> I am not quite sure how to format the output, but this looks like
> something designed by an engineer who knows too much about the topic.
> I suspect users won't find the use of things like "relname" and
> "blkno" too easy to understand. At least I think we should say
> "relation, block, offset, attribute" instead of "relname, blkno,
> offnum, attnum". I would probably drop the parenthesis and add spaces,
> so that you end up with something like:
>
> relation "%s", block "%s", offset "%s", attribute "%s":
>
> I would also define variant strings so that we entirely omit things
> that are NULL. e.g. have four strings:
>
> relation "%s":
> relation "%s", block "%s":(
> relation "%s", block "%s", offset "%s":
> relation "%s", block "%s", offset "%s", attribute "%s":
>
> Would it make it more readable if we indented the continuation line by
> four spaces or something?

I tried it that way and agree it looks better, including having the msg line indented four spaces. Changed.

> + corruption_cnt++;
> + printf("%s\n", error);
> + pfree(error);
>
> Seems like we could still print the relation name in this case, and
> that it would be a good idea to do so, in case it's not in the message
> that the server returns.

We don't know the relation name in this case, only the oid, but I agree that would be useful to have, so I added that.

> The general logic in this part of the code looks a bit strange to me.
> If ExecuteSqlQuery() returns PGRES_TUPLES_OK, we print out the details
> for each returned row. Otherwise, if error = true, we print the error.
> But, what if neither of those things are the case? Then we'd just
> print nothing despite having gotten back some weird response from the
> server. That actually can't happen, because ExecuteSqlQuery() always
> sets *error when the return code is not PGRES_TUPLES_OK, but you
> wouldn't know that from looking at this code.
>
> Honestly, as written, ExecSqlQuery() seems like kind of a waste. The
> OrDie() version is useful as a notational shorthand, but this version
> seems to add more confusion than clarity. It has only three callers:
> the ones in check_table() and check_indexes() have the problem
> described above, and the one in get_toast_oid() could just as well be
> using the OrDie() version. And also we should probably get rid of it
> entirely by fetching the toast OIDs the first time around, as
> mentioned above.

These functions have been factored out of pg_dump into fe_utils, so this bit of code review doesn't refer to anything now.

> check_indexes() lacks a function comment. It seems to have more or
> less the same problem as get_toast_oid() -- an extra query per table
> to get the list of indexes. I guess it has a better excuse: there
> could be lots of indexes per table, and we're fetching multiple
> columns of data for each one, whereas in the TOAST case we are issuing
> an extra query per table to fetch a single integer. But, couldn't we
> fetch information about all the indexes we want to check in one go,
> rather than fetching them separately for each table being checked? I'm
> not sure if that would create too much other complexity, but it seems
> like it would be quicker.

If the --skip-corrupt option is given, we need to only check the indexes associated with a table once the table has been found to be non-corrupt. Querying for all the indexes upfront, we'd need to keep information about which table the index came from, and check that against lists of tables that have been checked, etc. It seems pretty messy, even more so when considering the limited list facilities available to frontend code.

I have made no changes in this version, though I'm not rejecting your idea here. Maybe I'll think of a clean way to do this for a later patch?

> + if (settings.db == NULL)
> + fatal("no connection on entry to check_index");
> + if (idxname == NULL)
> + fatal("no index name on entry to check_index");
> + if (tblname == NULL)
> + fatal("no table name on entry to check_index");
>
> Again, probably these should be asserts, or if they're not, the error
> should be reported better and maybe elsewhere.
>
> Similarly in some other places, like expand_schema_name_patterns().

I removed these checks entirely.

> + * The loop below runs multiple SELECTs might sometimes result in
> + * duplicate entries in the Oid list, but we don't care.
>
> This is missing a which, like the place you copied it from, but the
> version in pg_dumpall.c is better.
>
> expand_table_name_patterns() should be reformatted to not gratuitously
> exceed 80 columns. Ditto for expand_index_name_patterns().

Refactoring into fe_utils, as mentioned above.

> I sort of expected that this patch might use threads to allow parallel
> checking - seems like it would be a useful feature.

Yes, I think that makes sense, but I'm going to work on that in the next patch.

> I originally intended to review the docs and regression tests in the
> same email as the patch itself, but this email has gotten rather long
> and taken rather longer to get together than I had hoped, so I'm going
> to stop here for now and come back to that stuff.

Attachment	Content-Type	Size
v3-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch	application/octet-stream	8.7 KB
v3-0002-Refactoring-ExecuteSqlQuery-and-related-functions.patch	application/octet-stream	46.7 KB
v3-0003-Creating-query_utils-frontend-utility.patch	application/octet-stream	10.4 KB
v3-0004-Adding-CurrentQueryHandler-logic.patch	application/octet-stream	5.2 KB
v3-0005-Refactoring-pg_dumpall-functions.patch	application/octet-stream	2.0 KB
v3-0006-Refactoring-expand_schema_name_patterns-and-frien.patch	application/octet-stream	6.6 KB
v3-0007-Moving-pg_dump-functions-to-new-file-option_utils.patch	application/octet-stream	17.0 KB
v3-0008-Normalizing-option_utils-interface.patch	application/octet-stream	14.5 KB
v3-0009-Adding-contrib-module-pg_amcheck.patch	application/octet-stream	80.1 KB

In response to

Re: new heapcheck contrib module at 2020-11-19 17:06:36 from Robert Haas

Responses

Re: new heapcheck contrib module at 2021-01-07 17:32:53 from Mark Dilger

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bharath Rupireddy	2021-01-07 07:15:13	Re: CheckpointLock needed in CreateCheckPoint()?
Previous Message	Masahiko Sawada	2021-01-07 06:53:52	Re: [PATCH] Feature improvement for CLOSE, FETCH, MOVE tab completion