Design for Synchronous Replication/ WAL Streaming

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Design for Synchronous Replication/ WAL Streaming
Date: 2008-08-19 03:13:40
Message-ID: 1219115620.5343.896.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For various reasons others have not been able to discuss detailed
designs in public. In an attempt to provide assistance with that I'm
providing my design notes here - not hugely detailed, just rough
sketches of how it can work. This may also help identify coordination
points and help to avert the code equivalent of a traffic jam later in
this release cycle.

Sync rep consists of 3 main parts:
* WAL sending
* WAL transmitting
* WAL receiving
WAL apply is essentially the same, so isn't discussed here.

WAL sending - would be achieved by having WAL writer issue calls to
transmit data. Individual backends would perform XLogInsert() to insert
a commit WAL record, then queue themselves up to wait for WAL writer to
perform the transmit up to the desired LSN (according to parameter
settings for synchronous_commit etc). The local WAL write and WAL
transmit would be performed together by the WAL writer, who would then
wake up backends once the log has been written as far as the requested
LSN. Very similar code to LWlocks, but queued on LSN, not lock arrival.
Should be possible to make queue in strict LSN order to avoid complexity
on wake-up. This then provides Group Commit feature at same time as
ensuring efficient WAL transmit.

WAL transmit - network layer is handled by plugin, as suggested by
Itagaki/Koichi. Requirements are efficient transfer of WAL, similar
configurability to other aspects of Postgres, including security.
Various approaches possible
* direct connect using new protocol
* implement slight protocol changes into standard PostgreSQL client,
similar to COPY streaming, just with slightly different initiation.
Allows us to use same config, security options as now with postmaster
handling initial connection.
Plugin architecture allows integration with various vendor supplied
options. Hopefully Postgres gets working functionality as default.

WAL receiving - separate process on standby server. Started by an option
in recovery.conf to receive streaming WAL rather than use files.
Separation of Startup process from WALReceiver process required to
ensure fast response to incoming network packets without slowing down
WAL apply, which needs to go fast to keep up with stream. WALreceiver
process would receive WAL and then write them to WAL buffers and also to
disk in the normal WAL files. Data buffered in WAL buffers allows
Startup process to read data within ReadRecord() from shared memory
rather than from files, so minimising changes required for Startup
process. Writing to WAL buffers also allows addition of a WAL bgreader
process that can pre-fetch buffers required later for WAL apply. (That
was a point of discussion previously, but its not a huge part of the
design and can be added as a performance feature fairly late, if we need
it). Data is written to disk to ensure the standby node can restart from
last restartpoint if it should crash, re-reading all WAL files and then
beginning to receive WAL from remote primary again. Files written and
cleaned up in exactly same way as on normal server: keep last two
restartpoints worth of xlogs, then cleanup at restartpoint time.

Integration point between this and Hot Standby is around postmaster
states and when the WALReceiver starts. That is the same time I expect
the bgwriter to start, so I will submit patch in next few days to get
that aspect sorted out.

If anybody is going to refactor xlog.c to avoid collisions, it had
better happen in next couple of weeks. Probably has to be Tom that does
this. Suggested splits:

* xlog stuff that happens in normal backends (some changes for WAL
streaming)
* recovery architecture stuff StartupXlog etc, checkpoints
* redo apply (major changes for WAL streaming)
* xlog rmgr stuff

Also need to consider how the primary node acts when standby is not
available. Should it hang, waiting certain time for recovery, or should
it continue to run in degraded mode? Probably another parameter.

Anyway, all of the above is a strawman design to assist everybody begin
to understand how this might all fit together. No doubt there are other
possible approaches. My personal concerns are that we minimise things
that prevent various developers from working alongside each other on
related features. So if the above design doesn't match what is being
worked on, then at least lets examine where the integration points are,
please. I hope and expect others are working on the WAL streaming design
and if something occurs to prevent that then I will provide time
singly/part of team to ensure this happens for 8.4.

I'll be posting more design stuff over next few weeks on Hot Standby
also.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-08-19 03:20:11 Re: Extending varlena
Previous Message Alvaro Herrera 2008-08-19 02:57:22 Re: Improving non-joinable EXISTS subqueries