Quick Links

Re: Synchronous replication patch built on SR

From:	Boszormenyi Zoltan <zb(at)cybertec(dot)at>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject:	Re: Synchronous replication patch built on SR
Date:	2010-05-14 13:33:49
Message-ID:	4BED513D.3030507@cybertec.at
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Fujii Masao írta:
> 2010/4/29 Boszormenyi Zoltan <zb(at)cybertec(dot)at>:
>
>> attached is a patch that does $SUBJECT, we are submitting it for 9.1.
>> I have updated it to today's CVS after the "wal_level" GUC went in.
>>
>
> I'm planning to create the synchronous replication patch for 9.0, too.
> My design is outlined in the wiki. Let's work together to do the design
> of it.
> http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability
>
> The log-shipping replication has some synchronization levels as follows.
> Which are you going to work on?
>
> The transaction commit on the master
> #1 doesn't wait for replication (already suppored in 9.0)
> #2 waits for WAL to be received by the standby
> #3 waits for WAL to be received and flushed by the standby
> #4 waits for WAL to be received, flushed and replayed by the standby
> ..etc?
>
> I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside
> the scope of my development for at least 9.1. In #4, read-only query
> can easily block recovery by the lock conflict and make the
> transaction commit on the master get stuck. This problem is difficult
> to be addressed within 9.1, I think. But the design and implementation
> of #2 and #3 need to be easily extensible to #4.
>
>
>> How does it work?
>>
>> First, the walreceiver and the walsender are now able to communicate
>> in a duplex way on the same connection, so while COPY OUT is
>> in progress from the primary server, the standby server is able to
>> issue PQputCopyData() to pass the transaction IDs that were seen
>> with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
>> signatures. I did by adding a new protocol message type, with letter
>> 'x' that's only acknowledged by the walsender process. The regular
>> backend was intentionally unchanged so an SQL client gets a protocol
>> error. A new libpq call called PQsetDuplexCopy() which sends this
>> new message before sending START_REPLICATION. The primary
>> makes a note of it in the walsender process' entry.
>>
>> I had to move the TransactionIdLatest(xid, nchildren, children) call
>> that computes latestXid earlier in RecordTransactionCommit(), so
>> it's in the critical section now, just before the
>> XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
>> call. Otherwise, there was a race condition between the primary
>> and the standby server, where the standby server might have seen
>> the XLOG_XACT_COMMIT record for some XIDs before the
>> transaction in the primary server marked itself waiting for this XID,
>> resulting in stuck transactions.
>>
>
> You seem to have chosen #4 as synchronization level. Right?
>

Yes.

> In your design, the transaction commit on the master waits for its XID
> to be read from the XLOG_XACT_COMMIT record and replied by the standby.
> Right? This design seems not to be extensible to #2 and #3 since
> walreceiver cannot read XID from the XLOG_XACT_COMMIT record.

Yes, this was my problem, too. I would have had to
implement a custom interpreter into walreceiver to
process the WAL records and extract the XIDs.

But at least the supporting details, i.e. not opening another
connection, instead being able to do duplex COPY operations in
a server-acknowledged way is acceptable, no? :-)

> How about
> using LSN instead of XID? That is, the transaction commit waits until
> the standby has reached its LSN. LSN is more easy-used for walreceiver
> and startup process, I think.
>

Indeed, using the LSN seems to be more appropriate for
the walreceiver, but how would you extract the information
that a certain LSN means a COMMITted transaction? Or
we could release a locked transaction in case the master receives
an LSN greater than or equal to the transaction's own LSN?

Sending back all the LSNs in case of long transactions would
increase the network traffic compared to sending back only the
XIDs, but the amount is not clear for me. What I am more
worried about is the contention on the ProcArrayLock.
XIDs are rarer then LSNs, no?

> What if the "synchronous" standby starts up from the very old backup?
> The transaction on the master needs to wait until a large amount of
> outstanding WAL has been applied? I think that synchronous replication
> should start with *asynchronous* replication, and should switch to the
> sync level after the gap between servers has become enough small.
> What's your opinion?
>

It's certainly one option, which I think partly addressed
with the "strict_sync_replication" knob below.
If strict_sync_replication = off, then the master doesn't make
its transactions wait for the synchronous reports, and the client(s)
can work through their WALs. IIRC, the walreceiver connects
to the master only very late in the recovery process, no?

It would be nicer if it could be made automatic. I simply thought
that there may be situations where the "strict" behaviour may be
desired. I was thinking about the transactions executed on the
master between the standby startup and walreceiver connection.
Someone may want to ensure the synchronous behaviour
for every xact, no matter the amount of time it needs. Someone
else will prefer synchronous behaviour whenever possible but
also ensure quick enough response time even if standbys aren't
started up yet. This dilemma cried for such a GUC, it cannot be
decided automatically.

>> I have added 3 new options, two GUCs in postgresql.conf and one
>> setting in recovery.conf. These options are:
>>
>> 1. min_sync_replication_clients = N
>>
>> where N is the number of reports for a given transaction before it's
>> released as committed synchronously. 0 means completely asynchronous,
>> the value is maximized by the value of max_wal_senders. Anything
>> in between 0 and max_wal_senders means different levels of partially
>> synchronous replication.
>>
>> 2. strict_sync_replication = boolean
>>
>> where the expected number of synchronous reports from standby
>> servers is further limited to the actual number of connected synchronous
>> standby servers if the value of this GUC is false. This means that if
>> no standby servers are connected yet then the replication is asynchronous
>> and transactions are allowed to finish without waiting for synchronous
>> reports. If the value of this GUC is true, then transactions wait until
>> enough synchronous standbys connect and report back.
>>
>
> Why are these options necessary?
>
> Can these options cover more than three synchronization levels?
>

I think I explained it in my mail.

If min_sync_replication_clients == 0, then the replication is async.
If min_sync_replication_clients == max_wal_senders then the
replication is fully synchronous.
If 0 < min_sync_replication_clients < max_wal_senders then
the replication is partially synchronous, i.e. the master can wait
only for say, 50% of the clients to report back before it's considered
synchronous and the relevant transactions get released from the wait.

Best regards,
Zoltán Böszörményi

--
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

In response to

Re: Synchronous replication patch built on SR at 2010-05-14 11:56:11 from Fujii Masao

Responses

Re: Synchronous replication patch built on SR at 2010-05-14 19:15:24 from Robert Haas
Re: Synchronous replication patch built on SR at 2010-05-18 11:30:46 from Fujii Masao

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Marc G. Fournier	2010-05-14 13:39:06	Re: List traffic
Previous Message	Greg Stark	2010-05-14 13:16:47	Re: List traffic