From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Catalog/Metadata consistency during changeset extraction from wal |
Date: | 2012-06-22 06:48:41 |
Message-ID: | CA+U5nM+0FKYk=tOB+gwNx6YU4tqqxQZzb3QCHLvs0kgv=RVNzw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 21 June 2012 12:41, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> 2.)
> Keep the decoding site up2date by replicating the catalog via normal HS
> recovery
> mechanisms.
>
> Advantages:
> * most of the technology is already there
> * minimal overhead (space, performance)
> * no danger of out of sync catalogs
> * no support for command triggers required that can keep a catalog in sync,
> including oids
>
> Disadvantages:
> * driving the catalog recovery that way requires some somewhat intricate code
> as it needs to be done in lockstep with decoding the wal-stream
> * requires an additional feature to guarantee HS always has enough information
> to be queryable after a crash/shutdown
> * some complex logic/low-level fudging required to keep the transactional
> behaviour sensible when querying the catalog
> * full version/architecture compatibility required
> * the decoding site will always ever be only readable
My initial reaction was "this wont work", but that requires
qualification since this is a complex topic: You can use this
approach as long as you realise that the catalog it gives can never be
rewound.
So the generic approach to "construct me a catalog as of this LSN"
would need to start with a base backup of the catalog and then roll
forward to the appropriate LSN. Which means a generic user of this
approach would need to be able to construct an initial catalog using a
PITR.
Constructing a decoding site requires you to
a) take a partial base backup of the catalog
b) apply WAL records to bring that forwards to the correct LSN, which
would require some alteration of the recovery code to skip the files
missing in a)
So taking the approach of a decoding site means we have to modify
recovery code, and even when we do that we still end up with a
difficult to deploy option in the real world. Difficult to deploy
becaus we need a whole new instance of Postgres, plus we need all of
the WAL files, which could easily be impractical.
The overall approach is good, but the architecture is wrong. What we
need is a "catalog base backup" and a means of rolling forward to the
appropriate LSN. Rolling forward using WAL is too bulky, so we need a
separate log of DDL changes to the catalog. So what we need is a
"catalog base backup" plus a "ddl-log".
And we need to be able to reconstruct the correct catalog on the
target server directly.
To translate the WAL we maintain a secondary set of catalog tables,
which only exist for logical replication. These are normal,
non-versioned tables, but held in a new schema pg_logical or similar.
One reason why this must be a secondary catalog is to allow the
translation to take place on the target server, and to allow
translation of WAL from a prior version of the catalog - so we can
allow online upgrades across different catalog versions (and possibly
major versions).
The secondary catalog is used in place of the normal catalog during
InitCatalogCache() in the apply process. All the normal caches exist,
they just point to secondary relations rather than the normal ones.
When we initialise replication we take a copy of the appropriate
tables, columns and rows in a catalog-base-backup, using something
like pg_dump. Overall, this is much smaller than normal catalog since
it avoids temp tables, and anything not related to WAL translation.
On each non-temp change to the database we record changes as SQL in
the ddl-log, together with the LSN of the change.
When number of changes in ddl-log hits a limit we take a new
catalog-base-backup. This process is similar to a checkpoint, but much
less frequent, lets call it a ddl-checkpoint.
When we start to read WAL logs to translate them, we start by
truncating/re-bootstrapping and reloading the secondary catalog from
the base backup. We then apply all changes from the ddl-log (which is
just a sequence of SQL statements) up until the LSN at the start of
WAL. The secondary catalog is then an exact copy of the catalog as of
that LSN.
As we read through WAL we apply further changes to secondary catalog
so it maintains in lock step with the WAL we currently read.
Having the ddl-base-backup and ddl-log allows reconstruction of the
catalog without needing to put whole catalog into WAL each checkpoint.
We can truncate old WAL segments and yet still recreate the DDL needed
to translate current WAL data. As a result, ddl-checkpoints are much
less frequent, perhaps weekly or monthly, rather than every few
minutes.
The whole process is similar in algorithm to recovery, but is just
normal userspace tables and SQL.
Constructing the correct catalog seems to be the heart of this
problem, so it is likely to take a while and look complex. Getting the
HS initial state was around 50% of the effort in making it all work,
so I guess its similar here.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro HORIGUCHI | 2012-06-22 08:01:28 | Re: pl/perl and utf-8 in sql_ascii databases |
Previous Message | Amit Kapila | 2012-06-22 06:46:19 | Re: Catalog/Metadata consistency during changeset extraction from wal |