Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby
Date: 2009-07-07 11:51:27
Message-ID: 3f0b79eb0907070451n37811f28v5a6671c5f9886caa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the comment!

On Tue, Jul 7, 2009 at 5:07 PM, Heikki
Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> pg_read_xlogfile() feels like a quite hacky way to implement that. Do we
> require the master to always have read access to the PITR archive? And
> indeed, to have a PITR archive configured to begin with. If you need to
> set up archiving just because of the standby server, how do old files
> that are no longer required by the standby get cleaned up?
>
> I feel that the master needs to explicitly know what is the oldest WAL
> file the standby might still need, and refrain from deleting files the
> standby might still need. IOW, keep enough history in pg_xlog. Then we
> have the risk of running out of disk space on pg_xlog if the connection
> to the standby is lost for a long time, so we'll need some cap on that,
> after which the master declares the standby as dead and deletes the old
> WAL anyway. Nevertheless, I think that would be much simpler to
> implement, and simpler for admins. And if the standby can read old WAL
> segments from the PITR archive, in addition to requesting them from the
> primary, it is just as safe.

I think of making pg_read_xlogfile() read the XLOG files from pg_xlog when
restore_command is not specified or returns non-zero code (ie. failure). So,
pg_read_xlogfile() with the following conditions might already cover the case
you described.

- checkpoint_segments = N (big number)
- restore_command = ''

In this case, we can expect that the XLOG files which are required for the
standby exist in pg_xlog because of big checkpoint_segments. And,
pg_read_xlogfile() reads them only from pg_xlog. checkpoint_segments
would play a role of the cap and determine the maximum disk size of
pg_xlog. The overflow files which might be no longer required for the
standby are removed safely by postgres. OTOH, if there is not enough
disk space for pg_xlog, we can specify restore_command and decrease
checkpoint_segments. This is more flexible approach, I think.

But, if the primary should not restore any archived file at any time, I have
only to get rid of the code which pg_read_xlogfile() restores it?

> I'd like to see a description of the proposed master/slave protocol for
> replication. If I understood correctly, you're proposing that the
> standby server connects to the master with libpq like any client,
> authenticates as usual, and then sends a message indicating that it
> wants to switch to "replication mode". In replication mode, normal FE/BE
> messages are not accepted, but there's a different set of message types
> for tranferring XLOG data.

http://archives.postgresql.org/message-id/4951108A.5040608@enterprisedb.com
> I don't think we need or should
> allow running regular queries before entering "replication mode". the
> backend should become a walsender process directly after authentication.

I changed the protocol according to your suggestion.
Here is the current protocol:

On start-up, the standby calls PQstartReplication() which is new libpq
function. It sends the startup packet with a special code for replication
to the primary, like a cancel request. The backend which received this
code becomes walsender directly. Authentication is performed as
normal. Then, walsender switches the XLOG file, and sends the
ReplicationStart message 'l' which includes the timeline ID and the
replication start XLOG position.

ReplicationStart (B)
Byte1('l'): Identifies the message as a replication-start indicator.
Int32(17): Length of message contents in bytes, including self.
Int32: The timeline ID
Int32: The start log file of replication
Int32: The start byte offset of replication

After that, walsender sends the XLogData message 'w' which includes
the XLOG records, the flag (e.g. indicates whether the records should
be fsynced or not), and the XLOG position, in real time. The standby
receives the message using PQgetXLogData() which is new libpq
function. OTOH, after writing or fsyncing the records, the standby
sends the XLogResponse message 'r' which includes the flag and the
position of the written/fsynced records, using PQputXLogRecPtr()
which is new libpq function.

XLogData (B)
Byte1('w'): Identifies the message as XLOG records.
Int32: Length of message contents in bytes, including self.
Int8: Flag bits indicating how the records should be treated.
Int32: The log file number of the records.
Int32: The byte offset of the records.
Byte n: The XLOG records.

XLogResponse (F)
Byte1('r'): Identifies the message as ACK for XLOG records.
Int32: Length of message contents in bytes, including self.
Int8: Flag bits indicating how the records were treated.
Int32: The log file number of the records.
Int32: The byte offset of the records.

Normal exit of walsender (e.g. by smart shutdown) sends the
ReplicationEnd message 'z'. OTOH, normal exit of walreceiver
sends the existing Terminate message 'X'.

The above protocol is used between walsender and walreceiver.

> I'd like to see a more formal description of that protocol and the new
> message types. Some examples of how they would be in different
> scenarios, like when standby server connects to the master for the first
> time and needs to catch up.

If there is a missing XLOG file which is required for recovery, the
startup process connects to the primary as a normal client, and
receives the binary contents of the file by using the following SQL.
This has nothing to do with the above protocol. So, the transfer of
missing file and synchronous XLOG streaming are performed
concurrently.

COPY (SELECT pg_read_xlogfilie('filename', true)) TO STDOUT WITH BINARY

If no missing files are found (ie. recovery of the standby has
reached the replication start position), the transfer of file drops
out of use.

> Looking at the patch briefly, it seems to assume that there is only one
> WAL sender active at any time. What happens when a new WAL sender
> connects and one is active already?

The new request is refused because of existing walsender.

> While supporting multiple slaves
> isn't a priority, I think we should support multiple WAL senders right
> from the start. It shouldn't be much harder, and otherwise we need to
> ensure that the switch from old WAL sender to a new one is clean, which
> seems non-trivial. Or not accept a new WAL sender while old one is still
> active,

Yeah, the current patch doesn't accept a new walsender while old
one is still active.

> but then a dead WAL sender process (because the standby suddenly
> crashed, for example) would inhibit a new standby from connecting,
> possibly for several minutes.

Yes, new standby cannot start walsender until walsender detects the
death of old standby. You can shorten the time to detect it by setting
some timeout (replication_timeout and some keepalive parameters).
I don't think that it's a problem that walsender cannot start for a short
time. You think that walsender must *always* be able to start?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2009-07-07 12:47:49 Re: Merge Append Patch merged up to 85devel
Previous Message Andrew Dunstan 2009-07-07 11:46:57 Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby