Re: Catalog/Metadata consistency during changeset extraction from wal

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Florian Pflug <fgp(at)phlo(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Catalog/Metadata consistency during changeset extraction from wal
Date: 2012-06-25 15:34:13
Message-ID: CA+TgmobL-GY7pXou-9huHiyEU3so1YYXBocXMcUR1OeoY3rMhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 25, 2012 at 9:43 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > The only theoretical way I see against that problem would be to postpone
>> > all relation unlinks untill everything that could possibly read them has
>> > finished. Doesn't seem to alluring although it would be needed if we
>> > ever move more things of SnapshotNow.
>> >
>> > Input/Ideas/Opinions?
>>
>> Yeah, this is slightly nasty.  I'm not sure whether or not there's a
>> way to make it work.
> Postponing all non-rollback unlinks to the next "logical checkpoint" is the
> only thing I can think of...

There are a number of cool things we could do if we postponed unlinks.
Like, why can't we allow concurrent read-only queries while a CLUSTER
operation is in progress? Well, two reasons. The first is that we
currently can't do ANY DDL with less than a full table lock because of
SnapshotNow-related race conditions. The second is that people might
still need to look at the old heap after the CLUSTER transaction
commits. Some kind of delayed unlink facility where we
garbage-collect relation backing files when their refcount falls to
zero would solve the second problem - not that that's any help by
itself without a solution to the first one, but hey.

>> I had another idea.  Suppose decoding happens directly on the primary,
>> because I'm still hoping there's a way to swing that.  Suppose further
>> that we handle DDL by insisting that (1) any backend which wants to
>> add columns or change the types of existing columns must first wait
>> for logical replication to catch up and (2) if a backend which has
>> added columns or changed the types of existing columns then writes to
>> the modified table, decoding of those writes will be postponed until
>> transaction commit.  I think that's enough to guarantee that the
>> decoding process can just use the catalogs as they stand, with plain
>> old SnapshotNow.
> I don't think its that easy. If you e.g. have multiple ALTER's in the same
> transaction interspersed with inserted rows they will all have different
> TupleDesc's.

If new columns were added, then tuples created with those older
tuple-descriptors can still be interpreted with the latest
tuple-descriptor.

Columns that are dropped or retyped are a little trickier, but
honestly... how much do we care about those cases? How practical is
it to suppose we're going to be able to handle them sanely anyway?
Suppose that the user defines a type which works just like int4 except
that the output functions writes out each number in pig latin (and the
input function parses pig latin). The user defines the types as
binary coercible to each other and then does ALTER TABLE on a large
table with an int4 column, transforming it into an int4piglatin
column. Due to Noah Misch's fine work, we will conclude that no table
rewrite is needed. But if logical replication is in use, then in
theory we should scan the whole table and generate an LCR for each row
saying "the row with primary key X was updated, and column Y, which
used to contain 42, now contains ourty-two-fay". Otherwise, if we're
doing heterogenous replication into a system that just stores that
column as text, it'll end up with the wrong contents. On the other
hand, if we're trying to ship data to another PostgreSQL instance
where the column hasn't yet been updated, then all of those LCRs are
just going to error out when we try to apply them.

A more realistic scenario where you have the same problem is with
something like ALTER TABLE .. ADD COLUMN .. DEFAULT. If you add a
column with a default in a single step (as opposed to first adding the
column and then setting its default), we rewrite the table and set
every row to the default value. Should that generate LCRs showing
every row being updated to add that new value, or should we generate
no LCRs and assume that the DBA will independently do the same
operation on the remote side? Either answer could be correct,
depending on how the LCRs are being used. If you're just rewriting
with a constant default, then perhaps the sensible thing is to
generate no LCRs, since it will be more efficient to mimic the
operation on the remote side than to replay the changes row-by-row.
But what if the default isn't a constant, like maybe it's
nextval('new_synthetic_pkey_seq') or even something like now(). In
those cases, it seems quite likely that if you don't generate LCRs,
manual user intervention will be required to get things back on track.
On the other hand, if you do generate LCRs, the remote side will
become horribly bloated on replay, unless the LCRs also instruct the
far side that they should be applied via a full-table rewrite.

Can we just agree to punt all this complexity for version 1 (and maybe
versions 2, 3, and 4)? I'm not sure what Slony does in situations
like this but I bet for a lot of replication systems, the answer is
"do a full resync". In other words, we either forbid the operation
outright when the table is enabled for logical replication, or else we
emit an LCR that says, in effect, "transaction 12345 monkeyed with the
table, please resync". It strikes me that it's really the job of some
higher-level control logic to decide what the "correct" behavior is in
these cases; the decoding process doesn't really have enough
information about what the user is trying to do to make a sensible
decision anyway. It would be nice to be able to support some simple
cases like "adding a column that has no default" or "dropping a
column" without punting, but going much further than that seems like
it will require embedding policy decisions that should really be
happening at a higher level.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-06-25 15:34:21 Re: [PATCH 04/16] Add embedded list interface (header only)
Previous Message David E. Wheeler 2012-06-25 15:23:34 Re: warning handling in Perl scripts