Re: amcheck hardening

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: amcheck hardening
Date: 2021-03-13 19:29:49
Message-ID: 308614D2-1852-4659-A566-4C18DF8ECFE9@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mar 13, 2021, at 11:06 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Sat, Mar 13, 2021 at 10:35 AM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> The testing strategy I'm using is to corrupt heap and btree pages in schema "public" of the "regression" database created by `make installcheck`, by overwriting random bytes in randomly selected locations on pages after the page header. Then I run `pg_amcheck regression` and see if anything segfaults. Doing this repeatedly, with random bytes and locations within files not the same from one run to the next, I can find the locations of segfaults that are particularly common.
>
> That seems like a reasonable starting point to me.
>
>> The first common problem [1] happens at verify_nbtree.c:1422 when a corruption report is being generated. The generation does not seem entirely safe, and the problematic bit can be avoided
>
>> The temPointerGetBlockNumberNoCheck(tid) seems to be unsafe here.
>
> I think that the best way to further harden verify_nbtree.c at this
> point is to do the most basic validation of IndexTuples in our own new
> variant of a core bufpage.h macro: PageGetItemCareful(). In other
> words, we should invent the IndexTuple equivalent of the existing
> PageGetItemIdCareful() function (which already does the same thing for
> item pointers), and then replace all current PageGetItem() calls with
> PageGetItemCareful() calls -- we ban raw PageGetItem() calls from
> verify_nbtree.c forever.
>
> Here is some of the stuff that could go in PageGetItemCareful(), just
> off the top of my head:
>
> * The existing "if (tupsize != ItemIdGetLength(itemid))" check at the
> top of bt_target_page_check() -- make sure the length from the
> caller's line pointer agrees with IndexTupleSize().
>
> * Basic validation against the index's tuple descriptor -- in
> particular, that varlena headers are basically sane, and that the
> apparent range of datums is safely within the space on the page for
> the tuple.
>
> * Similarly, BTreeTupleGetHeapTID() should not be able to return a
> pointer that doesn't actually point somewhere inside the space that
> the target page has for the IndexTuple.
>
> * No external TOAST pointers, since this is an index AM, and so
> doesn't allow that.
>
> In general this kind of very basic validation should be pushed down to
> the lowest level code, so that it detects the problem as early as
> possible, before slightly higher level code has the opportunity to
> run. Higher level code is always going to be at risk of making
> assumptions about the data not being corrupt, because there is so much
> more of it, and also because it tends to roughly look like idiomatic
> AM code.

Excellent write-up! Thanks!

I'll work on this and get a patch set going if time allows.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Soumyadeep Chakraborty 2021-03-13 20:29:06 Re: PITR promote bug: Checkpointer writes to older timeline
Previous Message Noah Misch 2021-03-13 19:18:41 Re: pg_amcheck contrib application