Re: [CORE] Restore-reliability mode

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, pgsql-core <pgsql-core(at)postgresql(dot)org>
Subject: Re: [CORE] Restore-reliability mode
Date: 2015-06-04 23:53:12
Message-ID: CAMsr+YEA_YwGMTeG-zGDNL1_RwxN4dJYQ9xQfg8np3CoF4bQ1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 June 2015 at 22:43, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Josh,
>
> * Josh Berkus (josh(at)agliodbs(dot)com) wrote:
> > I would argue that if we delay 9.5 in order to do a 100% manual review
> > of code, without adding any new automated tests or other non-manual
> > tools for improving stability, then it's a waste of time; we might as
> > well just release the beta, and our users will find more issues than we
> > will. I am concerned that if we declare a cleanup period, especially in
> > the middle of the summer, all that will happen is that the project will
> > go to sleep for an extra three months.
>
> This is the exact same concern that I have. A delay just to have a
> delay is not useful. I completely agree that we need more automated
> testing, etc, though getting all of that set up and running could be
> done at any time too- there's no reason to wait, nor do I believe
> delaying 9.5 would make such automated testing appear.
>
>
In terms of specific testing improvements, things I think we need to have
covered and runnable on the buildfarm are:

* pg_dump and pg_restore testing (because it's scary we don't do this)
* WAL archiving based warm standby testing with promotion
* Two node streaming replication with promotion, both with a slot and with
archive fallback
* Three node cascading streaming replication with middle node promotion
then tail end node promotion
* Logical decoding streaming testing, comparing to expected decoded output
* DDL deparse test coverage for all operations
* pg_basebackup + start up from backup
* hard-kill the postmaster, start up from crashed datadir
* pg_start_backup, rsync, pg_stop_backup, start up in hot standby
* disk exhaustion tests both for pg_xlog and for the main datadir, showing
we can recover OK when disk is filled then space is freed
* Tests of crash recovery during various DDL operations

Obviously some of these overlap, so one test can cover more than one item.

Implementing these requires stepping outside the comfortable zone of
pg_regress and the isolationtester and having something that can manage
multiple data directories. It's also hard to be sure you're testing the
same thing each time - for example, when using streaming replication with
archive fallback, it might be tricky to ensure that your replica falls
behind and falls back to WAL archive each time. There's always SIGSTOP I
guess.

While these are multi-node tests, at least in PostgreSQL we can just run on
different ports, so there's no need to muck about with containers or VMs.

I already run some of these tests using Ansible for BDR, but I don't
imagine that'd be acceptable in core. It's Python, and it's not especially
well suited to use as a regression testing framework, it's just what I had
to hand and already needed for other automation tasks.

Is pg_tap a reasonable starting point for this sort of testing?

Am I missing obvious and important tests?

How would a test that would've caught the multixact issues look?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2015-06-05 00:14:06 Re: Further issues with jsonb semantics, documentation
Previous Message Thomas Munro 2015-06-04 23:47:43 Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1