From: | Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Zoltan Boszormenyi <zb(at)cybertec(dot)at>, hs(at)cybertec(dot)at |
Subject: | draft RFC: concept for partial, wal-based replication |
Date: | 2009-10-30 16:19:03 |
Message-ID: | 4AEB11F7.1090600@cybertec.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
hello ...
as my day has worked out quite nicely up to know i thought to f... it up
and post a new concept which has been requested by a customer. the goal
is to collect some feedback, ideas and so on (not to be mixed up with
"flames").
we have funding for this and we are trying to sort out how to do it the
best way. comments are welcome ...
note, this is a first draft i want to refine based on some comments.
here we go ...
Partial WAL Replication for PostgreSQL:
---------------------------------------
As of now the PostgreSQL community has provided patches and functionalities
which allow full WAL-based replication as well as hot-standby. To extend
this
functionality and to make PostgreSQL even more suitable for "enterprise"
computing than it is today, we have the commitment of a sponsor to fund
partial
replication for PostgreSQL 8.5 / 8.6.
This is the first draft of a proposal to make partial WAL-based
replication work
and to provide and additional set of fancy features to the community
which has
been waiting for real in-core replication for a decade or more.
Why partial replication?
------------------------
In some cases people have master servers which contain enormous amounts
of data
(XX TB or so). If more than just one replica of this data is needed it
might
happen that different slaves are used for different purposes. This
implies that
not all data will be used by all maschines.
An example: Consider a server at a phone company collecting phone calls,
billing
data, and maybe network routing data. Data is used by different
department and
one maschine is not enough to serve all three departments. With the new
functionality proposed here we could make 3 replicas each holding just a
group
of tables for specific tasks thus allowing people to buy cheaper
hardware for
slaves and use more maschines instead.
Current status:
---------------
Hot-standy and streaming replication have been a huge leap step forward
for the
community and what is proposed here will be an extension to those
patches and
functionalities. This concept is NOT aimed to replace anything - it is
mainly an
addon.
Nodes and replication filters:
------------------------------
As of 8.4 standby systems are done by creating an archive_command along
with a
base backup. Although it is easy to do some users still reported some
difficulties due to a total misunderstanding of PITR.
The idea is to add a functionality to add slaves like this:
CREATE REPLICA node_name
CONNECT FROM SLAVE 'connect_string'
TRANSFER COMMAND 'command'
[ USING replication_filter ];
'command' would be any shell script copying data from the local master
to the
new database node called node_name. Replication filters can be used to
make X
replicas contain the same tables. Filtersets can be created like this:
CREATE REPLICATION FILTER filter_name
[ EMPTY | FULL ] [ INCLUDE | EXCLUDE CHANGES ];
Replication filters can be modified ...
ALTER REPLICATION FILTER filter_name RENAME TO new_filtername;
ALTER REPLICATION FILTER filter_name
{ ADD | REMOVE } { TABLE | INDEX | SEQUENCE } object;
Filter sets can be dropped like this ...
DROP REPLICATION FILTER filter_name;
Internally CREATE REPLICA would initiate a base backup to the new slave
server
just like we would do it manually otherwise. The server would
automatically use
the user defined 'command' to copy one file after the other to the slave
box.
The idea is basically stolen from archive_command and friends. At this
stage we
either copy the entire instance as we would do it with a normal base
backup or
just what is needed (defined by the replication filter). Users would
automatically only copy data to a slave which is really needed there and
which
matches their filter config. If the copy is done, we can register the
new node
inside a system table and commit the transaction. Also, we can automatically
create a useful recovery.conf setup - we know how to connect from the
slave to
the master (we can use ' CONNECT FROM SLAVE [ USING ] ' to write a proper
recovery.conf file).
Tables can easily be added or removed from a replication filter with ALTER
REPLICATION FILTER.
Replicas can be removed easily:
DROP REPLICA node_name;
Why SQL to add a node? We are convinced that this is the most simplistic
way of
doing things. It is the most intuitive way of doing things. We believe it
gives users a real feeling of simplicity. The current way of doing base
backups
should stay in place as it is - it has proven to be nice for countless
tasks.
However, it is not suitable for managing 10 or more replicas easily.
Especially
not when they are not full blown copies of the master.
Technical ideas:
----------------
System tables:
We suggest to always replicate the entire system catalog. It woulde be
a total
disaster to try some other implementation. The same applies for other
tables - we
always replicate entire tables; no WHERE-clauses allowed when it comes to
replicating any table.
How can a query on the slave figure out if a table is around? The slave
just to
know "who it is". Then it can lookup easily from the replication filter
it is
using if a table is actually physically in place or not. If a table is
not in
place, we can easily error out.
Remove a table from the slave:
This is not too hard; the master received the command to kill a table
the slave.
We will send a request to remove all storage files related to a table
and adjust
the replication filter to make sure that the slave will not replay
content of
this table anymore.
Add a table to a slave:
This is slightly more tricky. We start collecting WAL for a table, stop
shipping
WAL, use the TRANSFER COMMAND to copy the files related to the table
added and
resume recovery / sending once the storage file is on the slave.
Addition stuff:
Of course there are many more consistency considerations here. We cannot
replicate an index if the table is not present, etc.
many thanks,
hans
--
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2009-10-30 16:58:46 | Re: some dead code in functioncmds.c |
Previous Message | Pavel Stehule | 2009-10-30 16:17:24 | Re: some dead code in functioncmds.c |