Re: Two-phase commit issues

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Two-phase commit issues
Date: 2005-05-19 06:30:15
Message-ID: Pine.OSF.4.61.0505190907400.219440@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 18 May 2005, Tom Lane wrote:

> * The major missing issue that I've come across so far is that
> subtransaction and multixact state isn't preserved across a crash.
> Assuming that we want to store only top-level XIDs in the shared-memory
> list of prepared XIDs (which I think is important), it is essential that
> crash restart rebuild the pg_subxact status for prepared transactions.
> The subxacts of a prepared xact have to be seen as still running, and
> they won't be unless the subxact links are there. Since subxact.c is
> designed to wipe all its state on restart, we need to recreate those
> entries. Fortunately this doesn't seem hard: the state file for a
> prepared xact will include all of its subxact XIDs, and we can just
> do SubTransSetParent() on them while rereading the state file. (AFAICS
> it's sufficient to make each subxact link directly to the top XID, even
> if there was a more complex hierarchy originally.) Similarly, we've got
> to reconstruct MultiXactIds that any prepared xacts are members of, else
> row-level locks taken out by prepared xacts won't be enforced correctly.
> I think this can be handled if we add to the state files a list of all
> MultiXactIds that each prepared xact belongs to, and then during restart
> forcibly recreate those MultiXactIds. (They would only be rebuilt with
> prepared XIDs, not any ordinary XIDs that might originally have been
> members.) This seems to require some new code in multixact.c, but not
> anything fundamentally difficult --- Alvaro, do you see any likely
> problems in this stuff?

The subtransaction part is in fact there already, and it's done just like
you described. RecoverPreparedTransactions function reads the subxids from
the state file and calls SubTransSetParent for them.

As Alvaro pointed out elsewhere, the multixacts are harder because a
backend doesn't know which multixactids it belongs to. AFAICS, the most
straightforward solution is to xlog every CreateMultixact call, so that
the multixact slru files can be completely reconstructed on recovery.

> * The patch is designed to dump state files into WAL as well as onto
> disk. Why? Wouldn't it be better just to write and fsync the state
> file before reporting successful prepare? That would get rid of the
> need for checkpoint-time fsyncs.

Performance and correctness. There mustn't be a valid state file on the
disk before the WAL entries of that transactions are on disk. Otherwise,
the recovery might recover a transaction that in fact aborted right after
it wrote the state file.

If we fsync the WAL prepare record first, and state file second, a crash
in between would make it impossible to recover the transaction though the
WAL says it's prepared.

WAL logging the state file completely saves us one fsync. The state files
are usually small, say < 1 kb, so the tradeoff to write it twice and save
one fsync is probably well worth it.

Third, we have to cater for PITR. I haven't given it much thought, but if
we want to do log shipping and PITR, I believe we must have everything in
the WAL.

> * I'm inclined to think that the "gid" identifiers for prepared
> transactions ought to be SQL identifiers (names), not string literals.
> Was there a particular reason for making them strings?

Sure. No Reason. While you're at it, do you think it's possible to make it
unlimited size? I couldn't think of a simple way.

> * What are we going to do with GUC variables? My feeling is that
> the only sane answer is that PREPARE is the same as COMMIT as far as
> local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect
> on GUC state. Otherwise it's really unclear what to do. Consider
> SET myvar = foo;
> BEGIN;
> SET myvar = bar;
> PREPARE gid;
> SHOW myvar; -- what do you see ... foo or bar?
> SET myvar = baz; -- is this even legal?
> ROLLBACK PREPARED gid;
> SHOW myvar; -- now what do you see ... foo or baz?
> Since local GUC changes aren't going to be saved/restored across a
> crash anyway, I can't see a point in doing anything really complex.
>
> * There are some fairly ugly cases associated with creation and deletion
> of temporary tables as well. I think we might want to just decree that
> you can't PREPARE a transaction that included creating or dropping a
> temp table. Does anyone have much of a problem with that?

I think the safest way to handle the GUC case as well is to just refuse to
prepare a transaction that has changed local GUC variables.

Another possibility is to rethink the contract of PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED. If PREPARE TRANSACTION would put the backend to
a state where you can't do anything else than COMMIT/ROLLBACK the prepared
transaction, we could do more sensible things with GUC and temp tables.
That would have complications of it's own though. What would happen if
another backend then tries to COMMIT/ROLLBACK the transaction the original
backend is still tied to?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrej Ricnik-Bay 2005-05-19 07:24:07 Contributing
Previous Message Simon Riggs 2005-05-19 06:12:07 Re: could not dump unrecognized node type: 500