Re: corrupt pages detected by enabling checksums

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-05-13 13:49:22
Message-ID: 20130513134922.GB27618@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-05-13 08:45:41 -0500, Jon Nelson wrote:
> On Mon, May 13, 2013 at 8:32 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-05-12 19:41:26 -0500, Jon Nelson wrote:
> >> On Sun, May 12, 2013 at 3:46 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> >> > On 5/10/13 1:06 PM, Jeff Janes wrote:
> >> >>
> >> >> Of course the paranoid DBA could turn off restart_after_crash and do a
> >> >> manual investigation on every crash, but in that case the database would
> >> >> refuse to restart even in the case where it perfectly clear that all the
> >> >> following WAL belongs to the recycled file and not the current file.
> >> >
> >> >
> >> > Perhaps we should also allow for zeroing out WAL files before reuse (or just
> >> > disable reuse). I know there's a performance hit there, but the reuse idea
> >> > happened before we had bgWriter. Theoretically the overhead creating a new
> >> > file would always fall to bgWriter and therefore not be a big deal.
> >>
> >> For filesystems like btrfs, re-using a WAL file is suboptimal to
> >> simply creating a new one and removing the old one when it's no longer
> >> required. Using fallocate (or posix_fallocate) (I have a patch for
> >> that!) to create a new one is - by my tests - 28 times faster than the
> >> currently-used method.
> >
> > I don't think the comparison between just fallocate()ing and what we
> > currently do is fair. fallocate() doesn't guarantee that the file is the
> > same size after a crash, so you would still need an fsync() or we
> > couldn't use fdatasync() anymore. And I'd guess the benefits aren't all
> > that big anymore in that case?
>
> fallocate (16MB) + fsync is still almost certainly faster than
> write+write+write... + fsync.
> The test I performed at the time did exactly that .. posix_fallocate + pg_fsync.
Sure, the initial file creation will be faster. But are the actual
individual wal writes (small, frequently fdatasync()ed) still faster?
That's the critical path currently.
Whether it is pretty much depends on how the filesystem manages
allocated but not initialized blocks...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-05-13 13:52:55 Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4
Previous Message Mark Salter 2013-05-13 13:48:37 Re: lock support for aarch64