From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Greg Smith <greg(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, jd(at)commandprompt(dot)com, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Steve Crawford <scrawford(at)pinpointresearch(dot)com>, pgsql-performance(at)postgresql(dot)org, Ben Chobot <bench(at)silentmedia(dot)com> |
Subject: | Re: BBU Cache vs. spindles |
Date: | 2010-12-23 02:12:23 |
Message-ID: | 201012230212.oBN2CNs22947@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance pgsql-www |
Bruce Momjian wrote:
> Greg Smith wrote:
> > Kevin Grittner wrote:
> > > I assume that we send a full
> > > 8K to the OS cache, and the file system writes disk sectors
> > > according to its own algorithm. With either platters or BBU cache,
> > > the data is persisted on fsync; why do you see a risk with one but
> > > not the other
> >
> > I'd like a 10 minute argument please. I started to write something to
> > refute this, only to clarify in my head the sequence of events that
> > leads to the most questionable result, where I feel a bit less certain
> > than I did before of the safety here. Here is the worst case I believe
> > you're describing:
> >
> > 1) Transaction is written to the WAL and sync'd; client receives
> > COMMIT. Since full_page_writes is off, the data in the WAL consists
> > only of the delta of what changed on the page.
> > 2) 8K database page is written to OS cache
> > 3) PG calls fsync to force the database block out
> > 4) OS writes first 4K block of the change to the BBU write cache. Worst
> > case, this fills the cache, and it takes a moment for some random writes
> > to process before it has space to buffer again (makes this more likely
> > to happen, but it's not required to see the failure case here)
> > 5) Sudden power interruption, second half of the page write is lost
> > 6) Server restarts
> > 7) That 4K write is now replayed from the battery's cache
> >
> > At this point, you now have a torn 8K page, with 1/2 old and 1/2 new
>
> Based on this report, I think we need to update our documentation and
> backpatch removal of text that says that BBU users can safely turn off
> full-page writes. Patch attached.
>
> I think we have fallen into a trap I remember from the late 1990's where
> I was assuming that an 8k-block based file system would write to the
> disk atomically in 8k segments, which of course it cannot. My bet is
> that even if you write to the kernel in 8k pages, and have an 8k file
> system, the disk is still accessed via 512-byte blocks, even with a BBU.
Doc patch applied.
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
Attachment | Content-Type | Size |
---|---|---|
/pgpatches/bbu | text/x-diff | 1.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | tuanhoanganh | 2010-12-23 14:20:59 | Re: PostgreSQL 9.0 x64 bit pgbench TPC very low question? |
Previous Message | Pierre C | 2010-12-22 21:50:16 | Re: MySQL HandlerSocket - Is this possible in PG? |
From | Date | Subject | |
---|---|---|---|
Next Message | Devrim GÜNDÜZ | 2010-12-26 08:55:16 | Should we move pre 8.1 to ftp-archives? |
Previous Message | Magnus Hagander | 2010-12-22 14:04:52 | Re: gitweb tab width |