From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Catalog/Metadata consistency during changeset extraction from wal |
Date: | 2012-06-21 11:41:25 |
Message-ID: | 201206211341.25322.andres@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Robert, Hi all,
Robert and I talked quite a bit about different methods of providing enough
information to extract tuples from wal. I don't think either of us is yet
really convinced of any individual method, so I want to recap our discussion
in one email so others can chime in without reading the already huge thread.
I hope I am not misrepesenting Roberts opinion here, but I am sure he will
correct me if I do ;)
To satisfy the different needs people have for changeset extraction we
currently think that for each individual tuple extracted from wal we need to
provide the following information:
a) one or more pieces of tuple data (HeapTupleData structs)
* INSERT: full new tuple
* UPDATE: full new tuple, old pkey (optionally the full old tuple)
* DELETE: old pkey (optionally the full old tuple)
b) the action performed (INSERT|UPDATE|DELETE)
c) the table on which the action was performed
d) access to the table structure (names, column types, ...) procs (*_out
functions for the individual columns)
The problem with getting that data is that at the point were decoding the wal
the catalog may have evolved significantly from the state it was in when the
tuple was put into the wal.
We can extract a) and b) without any problems (lets not talk about it here)
but we don't necessarily know how to make sense of the data because a
HeapTuple cannot be properly interpreted without the knowledge of c) and d).
I am of the opinion that c) is basically equivalent to solving d) because the
wal only contains the tuple (pg_database.oid, pg_tablespace.oid,
pg_class.relfilenode) of the table and not the 'pg_class.oid'. The relfilenode
is changed by operations that rewrite the table like ALTER TABLE ADD COLUMN
... DEFAULT ...; TRUNCATE; CLUSTER and some others.
A single transaction can contain tuples for different relfilenodes and with
different columns:
CREATE TABLE foo(id serial primary key, data text);
BEGIN;
INSERT INTO foo ...;
TRUNCATE foo;
INSERT INTO foo ...; -- same structure, different relfilenode
ALTER TABLE foo ADD COLUMN bar text;
INSERT INTO foo ...; -- same relfilenode, different table structure
ALTER TABLE foo ADD COLUMN zaphod text DEFAULT '';
INSERT INTO foo ...; -- different relfilenode, different table structure
COMMIT;
There are several additional complex scenarios.
In http://archives.postgresql.org/message-
id/201206192023(dot)20589(dot)andres(at)2ndquadrant(dot)com I listed which options I see for
reaching that goal.
A refined version of that list:
1.)
Decode on a different, possibly catalog-only, pg instance kept in sync using
the command trigger infrastructure (but not necessarily user-level defined
command triggers)
If the command/event trigger infrastructure logs into a system-catalog table
keeping the catalog in the correct state is relatively easy. When
replaying/converting a reassembled transaction everytime an INSERT into that
system table happens the contained DDL gets performed.
The locking on the generating side takes care of the concurrency aspects.
Advantages:
* minimal overhead (space, performance)
* allows additional tables/indexes/triggers if you take care with oid
allocation
* easy transactionally correct catalog behaviour behaviour
* the decoding instance can be used to store all data in a highly efficient
manner (no decoding, no full detoasting, ...)
* the decoding instance is fully writable without problems if you don't
generate conflicts (separate tables, non-overlapping writes, whatever)
* implementable in a pretty unintrusive way
Disadvantes:
* the table structure of replicated tables needs to be exactly the same
* the type definition + support procs needs to be similar enough to read the
data
* error checking of the above isn't easy but probably possible
* full version/architecture compatibility required
* a proxy instance required even if you want to replicate into some other
system/architecture/version
2.)
Keep the decoding site up2date by replicating the catalog via normal HS
recovery
mechanisms.
Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in sync,
including oids
Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate code
as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough information
to be queryable after a crash/shutdown
* some complex logic/low-level fudging required to keep the transactional
behaviour sensible when querying the catalog
* full version/architecture compatibility required
* the decoding site will always ever be only readable
3)
Multi-Versioned catalog
Below are two possible implementation strategies for that concept
Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in that
path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling back
to textual versions on mismatch
Disadvantages:
* decoding probably has to happen on the master which might not be what people
want performancewise
3a)
Change the system catalogs to be versioned
Advantages.
* catalog access is easy
* might be interesting for other users
Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off
3b)
Ensure that enough information in the catalog remains by fudging the xmin
horizon. Then reassemble an appropriate snapshot to read the catalog as the
tuple in question has seen it.
Advantages:
* should be implementable with low impact to general code
Disadvantages:
* requires some complex code for assembling snapshots
* it might be hard to guarantee that we always have enough information to
reassemble a snapshot (subxid overflows ...)
* impacts vacuum if replication to some site is slow
4.)
Log enough information in the walstream to make decoding possible using only
the walstream.
Advantages:
* Decoding can optionally be done on the master
* No catalog syncing/access required
* its possible to make this architecture independent
Disadvantage:
* high to very high implementation overhead depending on efficiency aims
* high space overhead in the wal because at least all the catalog information
needs to be logged in a transactional manner repeatedly
* misuses wal far more than other methods
* significant new complexity in somewhat cricital code paths (heapam.c)
* insanely high space overhead if the decoding should be possible architecture
independent
5.)
The actually good idea. Yours?
-----
I think 3a) is not likely to fly and I think 4) is too complex although Robert
isn't convinced of the latter argument.
In my opinion either 3b) or 1) are our best options because they seem to
support most of the usecases without huge costs in complexity and
runtime/space for users not using changeset extraction. 3b) seems to be
preferrable because imo its the most flexible choice and doesn't require a
second instance.
Any arguments against/for those?
I am going to talk about implementing 3b) in a separate email in a bit.
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro HORIGUCHI | 2012-06-21 12:02:58 | Re: pl/perl and utf-8 in sql_ascii databases |
Previous Message | Kyotaro HORIGUCHI | 2012-06-21 11:22:43 | Re: pl/perl and utf-8 in sql_ascii databases |