Re: Worries about delayed-commit semantics

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Worries about delayed-commit semantics
Date: 2007-06-22 08:49:47
Message-ID: 1182502188.9276.106.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2007-06-21 at 18:15 -0400, Tom Lane wrote:
> I've been reflecting a bit about whether the notion of deferred fsync
> for transaction commits is really safe. The proposed patch tries to
> ensure that no consequences of a committed transaction can reach disk
> before the commit WAL record is fsync'd, but ISTM there are potential
> holes in what it's doing. In particular the path that concerns me is
>
> (1) transaction A commits with deferred fsync;
>
> (2) transaction B observes some effect of A (eg, a committed-good tuple);
>
> (3) transaction B makes a change that is contingent on the observation.
>
> If B's changes were to reach disk in advance of A's commit record, we'd
> have a risk of logical inconsistency.

B's changes cannot reach disk before B's commit record. That is the
existing WAL-before-data rule implemented by the buffer manager.

If B can see A's changes, then A has written a commit record to the log
that is definitely before B's commit record. So B's commit will also
commit A's changes to WAL when it flushes at EOX. So whether A is a
guaranteed transaction or not, B can always rely on those changes.

I agree this feels unsafe when you first think about it, and was the
reason for me taking months before publishing the idea.

> The patch is doing what it can
> to prevent *direct* effects of A from reaching disk before the commit
> record does, but it doesn't (and I think cannot) extend this to indirect
> effects perpetrated by other transactions. An example of the sort of
> risk I'm worried about is a REINDEX omitting an index entry for a tuple
> that it sees as committed dead by A.
>
> Now this may be safe anyway, but it requires analysis that I don't
> recall anyone having put forward. The cases that I can see are:
>
> 1. Ordinary WAL-logged change in a shared buffer page. The change will
> not be allowed to reach disk before the associated WAL record does, and
> that WAL record must follow A's commit, so we're safe.
>
> 2. Non-WAL-logged change in a temp table. Could reach disk in advance
> of A's commit, but we don't care since temp table contents don't survive
> crashes anyway.
>
> 3. Non-WAL-logged change made via one of the paths we have introduced
> to avoid WAL overhead for bulk updates. In these cases it's entirely
> possible for the data to reach disk before A's commit, because B will
> fsync it down to disk without any sort of interlock, as soon as it
> finishes the bulk update. However, I believe it's the case that all
> these paths are designed to write data that no other transaction can see
> until after B commits. That commit must follow A's in the WAL log,
> so until it has reached disk, the contents of the bulk-updated file
> are unimportant after a crash.
>
> So I think it's probably all OK, but this is a sufficiently long chain
> of reasoning that it had better be checked over by multiple people and
> recorded as part of the design implications of the patch. Does anyone
> think any of this is wrong, or too fragile to survive future code
> changes? Are there cases I've missed?

I've done the analysis, but perhaps I should finish the docs now to aid
with review of the patch on the points you make.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2007-06-22 09:05:57 Re: tsearch in core patch
Previous Message Simon Riggs 2007-06-22 08:49:10 Re: Worries about delayed-commit semantics