From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | [PATCH 16/16] current version of the design document |
Date: | 2012-06-13 11:28:47 |
Message-ID: | 1339586927-13156-16-git-send-email-andres@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
From: Andres Freund <andres(at)anarazel(dot)de>
---
src/backend/replication/logical/DESIGN | 209 ++++++++++++++++++++++++++++++++
1 file changed, 209 insertions(+)
create mode 100644 src/backend/replication/logical/DESIGN
diff --git a/src/backend/replication/logical/DESIGN b/src/backend/replication/logical/DESIGN
new file mode 100644
index 0000000..2cf08ff
--- /dev/null
+++ b/src/backend/replication/logical/DESIGN
@@ -0,0 +1,209 @@
+=== Design goals for logical replication ===:
+- in core
+- fast
+- async
+- robust
+- multi-master
+- modular
+- as unintrusive as possible implementation wise
+- basis for other technologies (sharding, replication into other DBMSs, ...)
+
+For reasons why we think this is an important set of features please check out
+the presentation from the in-core replication summit at pgcon:
+http://wiki.postgresql.org/wiki/File:BDR_Presentation_PGCon2012.pdf
+
+While you may argue that most of the above design goals are already provided by
+various trigger based replication solutions like Londiste or Slony, we think
+that thats not enough for various reasons:
+
+- not in core (and thus less trustworthy)
+- duplication of writes due to an additional log
+- performance in general (check the end of the above presentation)
+- complex to use because there is no native administration interface
+
+We want to emphasize that this proposed architecture is based on the experience
+of developing a minimal prototype which we developed with the above goals in
+mind. While we obviously hope that a good part of it is reusable for the
+community we definitely do *not* expect that the community accepts this
++as-is. It is intended to be the basis upon which we, the community, can build
+and design the future logical replication.
+
+=== Basic architecture ===:
+Very broadly speaking there are several major pieces common to most approaches
+to replication:
+1. Source data generation
+2. Transportation of that data
+3. Applying the changes
+4. Conflict resolution
+
+
+1.:
+
+As we need a change stream that contains all required changes in the correct
+order, the requirement for this stream to reflect changes across multiple
+concurrent backends raises concurrency and scalability issues. Reusing the
+WAL stream for this seems a good choice since it is needed anyway and adresses
+those issues already, and it further means that we don't incur duplicate
+writes. Any other stream generating componenent would introduce additional
+scalability issues.
+
+We need a change stream that contains all required changes in the correct order
+which thus needs to be synchronized across concurrent backends which introduces
+obvious concurrency/scalability issues.
+Reusing the WAL stream for this seems a good choice since it is needed anyway
+and adresses those issues already, and it further means we don't duplicate the
+writes and locks already performance for its maintenance.
+
+Unfortunately, in this case, the WAL is mostly a physical representation of the
+changes and thus does not, by itself, contain the necessary information in a
+convenient format to create logical changesets.
+
+The biggest problem is, that interpreting tuples in the WAL stream requires an
+up-to-date system catalog and needs to be done in a compatible backend and
+architecture. The requirement of an up-to-date catalog could be solved by
+adding more data to the WAL stream but it seems to be likely that that would
+require relatively intrusive & complex changes. Instead we chose to require a
+synchronized catalog at the decoding site. That adds some complexity to use
+cases like replicating into a different database or cross-version
+replication. For those it is relatively straight-forward to develop a proxy pg
+instance that only contains the catalog and does the transformation to textual
+changes.
+
+This also is the solution to the other big problem, the need to work around
+architecture/version specific binary formats. The alternative, producing
+cross-version, cross-architecture compatible binary changes or even moreso
+textual changes all the time seems to be prohibitively expensive. Both from a
+cpu and a storage POV and also from the point of implementation effort.
+
+The catalog on the site where changes originate can *not* be used for the
+decoding because at the time we decode the WAL the catalog may have changed
+from the state it was in when the WAL was generated. A possible solution for
+this would be to have a fully versioned catalog but that again seems to be
+rather complex and intrusive.
+
+For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes)
+additional data needs to be logged, but the additional amount of data isn't
+that big. Requiring a primary-key for any change but INSERT seems to be a
+sensible thing for now. The required changes are fully contained in heapam.c
+and are pretty simple so far.
+
+2.:
+
+For transport of the non-decoded data from the originating site to the decoding
+site we decided to reuse the infrastructure already provided by
+walsender/walreceiver. We introduced a new command that, analogous to
+START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out all
+xlog records that pass through a filter.
+
+The on-the-wire format stays the same. The filter currently simply filters out
+all record which are not interesting for logical replication (indexes,
+freezing, ...) and records that did not originate on the same system.
+
+The requirement of filtering by 'origin' of a wal node comes from the planned
+multimaster support. Changes replayed locally that originate from another site
+should not replayed again there. If the wal is plainly used without such a
+filter that would cause loops. Instead we tag every wal record with the "node
+id" of the site that caused the change to happen and changes with a nodes own
+"node id" won't get applied again.
+
+Currently filtered records get simply replaced by NOOP records and loads of
+zeroes which obviously is not a sensible solution. The difficulty of actually
+removing the records is that that would change the LSNs. We currently rely on
+those though.
+
+The filtering might very well get expanded to support partial replication and
+such in future.
+
+
+3.:
+
+To sensibly apply changes out of the WAL stream we need to solve two things:
+Reassemble transactions and apply them to the target database.
+
+The logical stream from 1. via 2. consists out of individual changes identified
+by the relfilenode of the table and the xid of the transaction. Given
+(sub)transactions, rollbacks, crash recovery, subtransactions and the like
+those changes obviously cannot be individually applied without fully loosing
+the pretence of consistency. To solve that we introduced a module, dubbed
+ApplyCache which does the reassembling. This module is *independent* of the
+data source and of the method of applying changes so it can be reused for
+replicating into a foreign system or similar.
+
+Due to the overhead of planner/executor/toast reassembly/type conversion (yes,
+we benchmarked!) we decided against statement generation for apply. Even when
+using prepared statements the overhead is rather noticeable.
+
+Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do
+the apply. For now we decided to use only one process to do the applying,
+parallelizing that seems to be too complex for an introduction of an already
+complex feature.
+In our tests the apply process could keep up with pgbench -c/j 20+ generating
+changes. This will obviously heavily depend on the workload. A fully seek bound
+workload will definitely not scale that well.
+
+Just to reiterate: Plugging in another method to do the apply should be a
+relatively simple matter of setting up three callbacks to a different function
+(begin, apply_change, commit).
+
+Another complexity in this is how to synchronize the catalogs. We plan to use
+command/event triggers and the oid preserving features from pg_upgrade to keep
+the catalogs in-sync. We did not start working on that.
+
+
+4.:
+
+While we started to think about conflict resolution/avoidance we did not start
+to work on it. We currently *cannot* handle conflicts. We think that the base
+features/architecture should be aggreed uppon before starting with it.
+
+Multimaster tests were done with sequences setup with INCREMENT 2 and different
+start values on the two nodes.
+
+=== Current Prototype ===
+
+The current prototype consists of a series of patches that are split in
+hopefully sensible and coherent parts to make reviewing of individual parts
+possible.
+
+Its also available in the 'cabal-rebasing' branch on
+git.postgresql.org/users/andresfreund/postgres.git . That branch will modify
+history though.
+
+01: wakeup handling: reduces replication lag, not very interesting in this context
+
+02: Add zeroRecPtr: not very interesting either
+
+03: new syscache for relfilenode. This would benefit by some syscache experienced eyes
+
+04: embedded lists: This is a general facility, general review appreciated
+
+05: preliminary bgworker support: This is not ready and just posted as its
+ preliminary work for the other patches. Simon will post a real patch soon
+
+06: XLogReader: Review definitely appreciated
+
+07: logical data additions for WAL: Review definitely appreciated, I do not expect fundamental changes
+
+08: ApplyCache: Important infrastructure for the patch, review definitely appreciated
+
+09: Wal Decoding: Decode WAL generated with wal_level=logical into an ApplyCache
+
+10: WAL with 'origin node': This is another important base-piece for logical rep
+
+11: WAL segment handling changes: If the basic idea of adding a node_id to the
+ functions and adding a pg_lcr directory is acceptable the rest of the patch is
+ fairly boring/mechanical
+
+12: walsender/walreceiver changes: Implement transport/filtering of logical
+ changes. Very relevant
+
+13: shared memory/crash recovery state handling for logical rep: Very relevant
+ minus the TODO's in the commit message
+
+14: apply module: review appreciated
+
+15: apply process: somewhat dependent on the preliminary changes in 05, general
+ direction is visible, loads of detail work needed as soon as some design
+ decisions are agreed uppon.
+
+16: this document. Not very interesting after youve read it ;)
--
1.7.10.rc3.3.g19a6c.dirty
From | Date | Subject | |
---|---|---|---|
Next Message | Honza Horak | 2012-06-13 13:14:53 | Re: Ability to listen on two unix sockets |
Previous Message | Andres Freund | 2012-06-13 11:28:46 | [PATCH 15/16] Activate/Implement the "apply process" which applies received updates from another node |