Re: Perl modules for testing/viewing/corrupting/repairing your heap files

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Perl modules for testing/viewing/corrupting/repairing your heap files
Date: 2020-04-15 14:22:48
Message-ID: 913D6F73-8337-4FDA-B11E-EFFCA20E1A44@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Apr 14, 2020, at 6:17 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Wed, Apr 8, 2020 at 3:51 PM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> Recently, as part of testing something else, I had need of a tool to create
>> surgically precise corruption within heap pages. I wanted to make the
>> corruption from within TAP tests, so I wrote the tool as a set of perl modules.
>
> There is also pg_hexedit:
>
> https://github.com/petergeoghegan/pg_hexedit

I steered away from software released under the GPL, such as pg_hexedit, owing to difficulties in getting anything I develop accepted. (That's a hard enough problem without licensing issues.). I'm not taking a political stand for or against the GPL here, just a pragmatic position that I wouldn't be able to integrate pg_hexedit into a postgres submission.

(Thanks for writing pg_hexedit, BTW. I'm not criticizing it.)

The purpose of these perl modules is not the viewing of files, but the intentional and targeted corruption of files from within TAP tests. There are limited examples of tests in the postgres source tree that intentionally corrupt files, and as I read them, they employ a blunt force trauma approach:

In src/bin/pg_basebackup/t/010_pg_basebackup.pl:

> # induce corruption
> system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
> open $file, '+<', "$pgdata/$file_corrupt1";
> seek($file, $pageheader_size, 0);
> syswrite($file, "\0\0\0\0\0\0\0\0\0");
> close $file;
> system_or_bail 'pg_ctl', '-D', $pgdata, 'start';

In src/bin/pg_checksums/t/002_actions.pl:
> # Time to create some corruption
> open my $file, '+<', "$pgdata/$file_corrupted";
> seek($file, $pageheader_size, 0);
> syswrite($file, "\0\0\0\0\0\0\0\0\0");
> close $file;

These blunt force trauma tests are fine, as far as they go. But I wanted to be able to do things like

# Corrupt the tuple to look like it has lots of attributes, some of
# them null. This falsely creates the impression that the t_bits
# array is longer than just one byte, but t_hoff still says otherwise.
$tup->{HEAP_HASNULL} = 1;
$tup->{HEAP_NATTS_MASK} = 0x3FF;
$tup->{t_bits} = 0xAA;

or

# Same as above, but this time t_hoff plays along
$tup->{HEAP_HASNULL} = 1;
$tup->{HEAP_NATTS_MASK} = 0x3FF;
$tup->{t_bits} = 0xAA;
$tup->{t_hoff} = 32;

That's hard to do from a TAP test without modules like this, as you have to calculate by hand the offsets where you're going to write the corruption, and the bit pattern you are going to write to that location. Even if you do all that, nobody else is likely going to be able to read and maintain your tests.

I'd like an easy way from within TAP tests to selectively corrupt files, to test whether various parts of the system fail gracefully in the presence of corruption. What happens when a child partition is corrupted? Does that impact queries that only access other partitions? What kinds of corruption cause pg_upgrade to fail? ...to expand the scope of the corruption? What happens to logical replication when there is corruption on the primary? ...on the standby? What kinds of corruption cause a query to return data from neighboring tuples that the querying role has not permission to view? What happens when a NAS is only intermittently corrupt?

The modules I've submitted thus far are incomplete for this purpose. They don't yet handle toast tables, btree, hash, gist, gin, fsm, or vm, and I might be forgetting a few other things in the list. Before I go and implement all of that, I thought perhaps others would express preferences about how this should all work, even stuff like, "Don't bother implementing that in perl, as I'm reimplementing the entire testing structure in COBOL", or similarly unexpected feedback.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-15 14:38:24 Re: wrong relkind error messages
Previous Message Robert Haas 2020-04-15 14:12:14 Re: Parallel copy