From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Two-phase commit |
Date: | 2004-10-06 21:46:10 |
Message-ID: | 354.1097099170@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-patches |
Quite some time ago, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I haven't received any comments and there hasn't been any discussion on
> the implementation, I suppose that nobody has given it a try. :(
I finally got around to taking a close look at this. There's a good bit
undone, as you well know, but it seems like it can be the basis for a
workable feature. I do have a few comments to make.
At the API level, I like the PREPARE/COMMIT/ROLLBACK statements, but I
think you have missed a bet in that it needs to be possible to issue
"COMMIT PREPARED gid" for the same gid several times without error.
Consider a scenario where the transaction monitor crashes during the
commit phase. When it recovers, it will be aware that it had committed
to commit, but it won't know which nodes were successfully committed.
So it will need to resend the COMMIT commands. It would be bad for the
nodes to simply say "yes boss" if they are told to COMMIT a gid they
have no record of. So I think the gid's have to stick around after
COMMIT PREPARED or ROLLBACK PREPARED, and there needs to be a fourth
command (RELEASE PREPARED?) to actually remove the state data when the
transaction monitor is satisfied that everything's done. RELEASE of
an unknown gid is okay to be a no-op.
Implementation-wise, I really dislike storing the info in a shared hash
table, because I don't see any reasonable bound on the size of the hash
table (your existing code uses 100 which is about as arbitrary as it
gets). Plus the actual content of each entry is not fixed-size either.
This is not very workable given our fixed-size shared memory mechanism.
The idea that occurs to me instead is to not use WAL or shared memory at
all for keeping the prepared-transaction state info. Instead, suppose
that we store the status information in a file named after the GID,
"$PGDATA/pg_twophase/gid". We could write the file with a CRC similarly
to what's done for pg_control. Once such a file is written and fsync'd,
it's equally as reliable as a WAL record would be, so it seems safe
enough to me to report the PREPARE as done. COMMIT, ROLLBACK, and the
pg_prepared_xacts system view would look into the pg_twophase directory
to find out all about active prepared transactions; RELEASE PREPARED
would simply delete the appropriate file. (Note: commit or rollback
would need to take the transaction XID from the GID file and then look
in pg_clog to find out if the transaction were already committed. These
operations do not change the pg_twophase file, but they do write a
normal transaction-commit or -abort WAL record and update pg_clog.)
I think this would offer better performance as well as being more
scalable, because the implementation you have looks like it would have
some contention for the shared GID hashtable.
I would be inclined to require GIDs to be numbers (probably int8's)
instead of strings, so that we don't have any problems with funny
characters in the file names. That's negotiable though, as we could
certainly uuencode the strings or something to avoid that trap.
You were concerned about how to mark prepared transactions in pg_clog,
given that Alvaro had already commandeered state '11' for
subtransactions. Since only a toplevel transaction can be prepared,
it might work to allow state '11' with a zero pg_subtrans parent link
to mean a prepared transaction. This would imply factoring prepared
XIDs into GlobalXmin (so that pg_subtrans entries don't get recycled
too soon) but we probably have to do that anyway. AFAICS, prepared
but uncommitted XIDs have to be considered still InProgress, so if
they are less than GlobalXmin we'd lose.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | simon | 2004-10-06 21:50:01 | Re: Re: PITR question |
Previous Message | Tom Lane | 2004-10-06 19:16:26 | Re: DROP TABLESPACE causes panic during recovery |
From | Date | Subject | |
---|---|---|---|
Next Message | Oliver Jowett | 2004-10-06 22:36:47 | Re: Two-phase commit |
Previous Message | Bruce Momjian | 2004-10-06 19:04:23 | Re: Warning for psql history not supported |