Re: Plug-pull testing worked, diskchecker.pl failed

From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Chris Angelico <rosuav(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Plug-pull testing worked, diskchecker.pl failed
Date: 2012-10-24 16:18:53
Message-ID: CAOR=d=3XFcVgMu9Eyd0nFehW1x=qzrewxDzPcC0MVy3J5MD6XQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Oct 24, 2012 at 8:04 AM, Chris Angelico <rosuav(at)gmail(dot)com> wrote:
> On Tue, Oct 23, 2012 at 9:51 AM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
>> On Mon, Oct 22, 2012 at 7:17 AM, Chris Angelico <rosuav(at)gmail(dot)com> wrote:
>>> After reading the comments last week about SSDs, I did some testing of
>>> the ones we have at work - each of my test-boxes (three with SSDs, one
>>> with HDD) subjected to multiple stand-alone plug-pull tests, using
>>> pgbench to provide load. So far, there've been no instances of
>>> PostgreSQL data corruption, but diskchecker.pl reported huge numbers
>>> of errors.
>>
>> Try starting pgbench, and then halfway through the timeout for a
>> checkpoint timeout issue a checkpoint and WHILE the checkpoint is
>> still running THEN pull the plug.
>>
>> Then after bringing the server up (assuming pg starts up) see if
>> pg_dump generates any errors.
>
> Thanks for the tip. I've been flat-out at work these past few days and
> haven't gotten around to testing in the middle of a checkpoint, but I
> have done something that might also be of interest. It's inspired by a
> combination of diskchecker and pgbench; a harness that puts the
> database under load and retains a record of what's been done.
>
> In brief: Create a table with N (eg 100) rows, then spin as fast as
> possible, incrementing a counter against one random row and also
> incrementing the "Total" counter. When the database goes down, wait
> for it to come up again; when it does, check against the local copy of
> the counters and report any discrepancies.
>
> The code's written in Pike, using the same database connection logic
> that we use in our actual application (well, some of our code is C++
> and some is PHP, so this corresponds to one part of our app), so this
> is roughly representative of real usage.
>
> It's about a page or two of code: http://pastebin.com/UNTj642Y

Very cool. Nice little project.

> Currently, all the key parameters (database connection info (which has
> been censored for the pastebin version), pool size, thread count, etc)
> are just variables visible in the script, simpler than parsing
> command-line arguments.
>
> Is this a useful and plausible testing methodology? It's definitely
> showed up some failures. On a hard-disk, all is well as long as the
> write-back cache is disabled; on the SSDs, I can't make them reliable.

Yes it seems to be quite a good idea actually.

> Is a single table enough to test for corruption with?

If it fails, definitely, if it passes maybe.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Chris Angelico 2012-10-24 16:56:39 Re: Need sql to pull data from terribly architected table
Previous Message salah jubeh 2012-10-24 16:04:17 Re: Postgresql high available solution