From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | logical changeset generation v4 |
Date: | 2013-01-15 01:38:45 |
Message-ID: | 20130115013845.GE22155@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi everyone,
Here is the newest version of logical changeset generation.
Changes since last time round:
* loads and loads of bugfixes
* crash/restart persistency of in-memory structures in a crash safe manner
* very large transaction support (spooling to disk)
* rebased onto the newest version of xlogreader
Overview over the patches:
Xlogreader (separate patch):
[01] Centralize Assert* macros into c.h so its common between backend/frontend
[02] Provide a common malloc wrappers and palloc et al. emulation for frontend'ish environs
[03] Split out xlog reading into its own module called xlogreader
[04] Add pg_xlogdump contrib module
Those seem to be ready baring some infrastructure work around common
backend/frontend code for xlogdump.
Add capability to map from (tablespace, relfilenode) to pg_class.oid:
[05]: Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)
[06]: Add RelationMapFilenodeToOid function to relmapper.c
[07]: Add pg_relation_by_filenode to lookup up a relation by (tablespace, filenode)
Imo those are pretty solid although there are some doubts about the
correctness of [05] which I think are all fixed in this version:
The fundamental problem of adding a (tablespace, relfilenode) syscache
is that no unique index exists in pg_class over (relfilenode,
reltablespace) because relfilenode is set to a '0' (aka InvalidOid) when
the table is either a shared table or a nailed table. This cannot really
be changed as pg_class.relfilenode is not authoritative for those and
can possibly not even accessed (different table, early startup). We also
don't want to rely on the null bitmap, so we can't set it to NULL.
The reason why I think it is safe to use the added RELFILENODE syscache
as I have in those patches is that when looking a (tablespace, filenode)
pair up none of those duplicat '0' values will get looked up as there is
no point in looking up an invalid relfilenode. Instead the shared/nailed
relfilenodes will have to get mapped via RelationMapFilenodeToOid.
The alternative here seems to be to invent an own attoptcache style but
given that the above syscache is fairly performance critical and should
do invalidations in a sensible manner that seems to be an unnecessary
amount of code.
Any opinions here?
[08] wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement
Its useful to represent values that are not a valid CommandId. Add such
a representation.
Imo this is straightforward and easy.
[09] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader
For timetravel access to the catalog we need to be able to lookup (cmin,
cmax) pairs of catalog rows when were 'inside' that TX. This patch just
adapts the signature of the *Satisfies routines to expect a HeapTuple
instead of a HeapTupleHeader. The amount of changes for that is fairly
low as the HeapTupleSatisfiesVisibility macro already expected the
former.
It also makes sure the HeapTuple fields are setup in the few places that
didn't already do so.
[10] wal_decoding: Allow walsender's to connect to a specific database
For logical decoding we need to be able access the schema of a database
- for that we need to be connected to a database. Thus allow recovery
connections to connect to a specific database.
This patch currently has the disadvantage that its not possible anymore
to connect to a database thats actually named "replication" as the
decision whether a connection goes to a database or not is made based
uppon dbname != replication.
Better ideas?
[11] wal_decoding: Add alreadyLocked parameter to GetOldestXminNoLock
Pretty boring preparatory for being able to nail a certain xid as the
global horizon. I don't think there is much to be said about this
anymore, it already has been somewhat discussed.
[12] wal_decodign: Log xl_running_xact's at a higher frequency than checkpoints are done
Make the bgwriter emit a xl_running_xacts record every 15s if there is
xlog activity in the system.
Imo this isn't too complicated and already beneficial for HS so it could
be applied separately.
[13] copydir: make fsync_fname public
This should probably go to some other file, fd.[ch]? Otherwise its
pretty trivial.
[14] wal decoding: Add information about a tables primary key to struct RelationData
Back when discussing the first prototype of BDR Heikki was concerned of
doing a search for the primary key during heap_delete. I agree that that
isn't really a good idea.
So remember the primary key (or a candidate key) when looking through
the available indexes in RelationGetIndexList().
I don't really like the name rd_primary as it also contains candidate
keys (i.e. indimmediate, inunique, noexpression, notnull), better
suggestions?
I don't think there is too much debatable in here, but there is no
independent benefit of applying it separately.
[15] wal decoding: Introduce wal decoding via catalog timetravel
The heart of changeset generation.
Built out of several parts:
* snapshot building infrastructure
* transaction reassembly
* shared memory state for replication slots
* new wal_level=logical that catches more data
* (local) output plugin interface
* (external) walsender interface
[16] wal decoding: Add a simple decoding module in contrib named 'test_decoding'
An example output plugin also used in regression tests
[17] wal decoding: Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes
An application to receive changes over the walsender/replication=1
interface.
[18] wal_decoding: Add test_logical_replication extension for easier testing of logical decoding
An extension that allows to use logical decoding from sql. This isn't
really suitable for production, high performance use but its usefor for
development and more importantly it makes it relatively easy to write
regression tests without new infrastructure.
I am starting to be happy about the state of this!
Open issues & questions:
1) testing infrastructure
2) Determination of replication slots
3) Options for output plugins
4) the general walsender interface
5) Additional catalog tables
1) Currently all the tests are in patch [18] which is a contrib
module. There are two reasons for putting them there:
First the tests need wal_level=logical which isn't the case with the
current regression tests.
Second, I don't think the test_logical_replication functions should live
in core as they shouldn't be used for a production replication scenario
(causes longrunning transactions, requires polling) , but I have failed
to find a neat way to include a contrib extension in the plain
regression tests.
2) Currently the logical replication infrastructure assigns a 'slot-id'
when a new replica is setup. That slot id isn't really nice
(e.g. "id-321578-3"). It also requires that [18] keeps state in a global
variable to make writing regression tests easy.
I think it would be better to make the user specify those replication
slot ids, but I am not really sure about it.
3) Currently no options can be passed to an output plugin. I am thinking
about making "INIT_LOGICAL_REPLICATION 'plugin'" accept the now widely
used ('option' ['value'], ...) syntax and pass that to the output
plugin's initialization function.
4) Does anybody object to:
-- allocate a permanent replication slot
INIT_LOGICAL_REPLICATION 'plugin' 'slotname' (options);
-- stream data
START_LOGICAL_REPLICATION 'slotname' 'recptr';
-- deallocate a permanent replication slot
FREE_LOGICAL_REPLICATION 'slotname';
5) Currently its only allowed to access catalog tables, its fairly
trivial to extend this to additional tables if you can accept some
(noticeable but not too big) overhead for modifications on those tables.
I was thinking of making that an option for tables, that would be useful
for replication solutions configuration tables.
Further todo:
* don't reread so much WAL after a restart/crash
* better TOAST support, the current one can fail after A DROP TABLE
* only peg a new "catalog xmin" instead of the global xmin
* more docs about the internals
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the
outside. There have been improvements vs 3.1 but not enough
* abort too old replication slots
Puh.
The current git tree is at:
git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-decoding-rebasing-cf4
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4
The xlogreader development happens xlogreader_4.
Input?
Greetings,
Andres Freund
PS: Thanks for the input & help so far!
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2013-01-15 01:39:05 | Re: [PATCH] Compile without warning with gcc's -Wtype-limits, -Wempty-body |
Previous Message | Andrew Dunstan | 2013-01-15 00:52:56 | Re: json api WIP patch |