From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Changeset Extraction Interfaces |
Date: | 2013-12-12 18:52:34 |
Message-ID: | 20131212185234.GA18665@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2013-12-12 12:13:24 -0500, Robert Haas wrote:
> On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > If we were to start out streaming changes before the last running
> > transaction has finished, they would be visible in that exported
> > snapshot and you couldn't use it to to roll forward from anymore.
>
> Actually, you could. You'd just have to throw away any transactions
> whose XIDs are visible to the exported snapshot. In other words, you
> begin replication at time T0, and all transactions which begin after
> that time are included in the change stream. At some later time T1,
> all transactions in progress at time T0 have ended, and now you can
> export a snapshot at that time, or any later time, from which you can
> roll forward. Any change-stream entries for XIDs which would be
> visible to that snapshot shouldn't be replayed when rolling forward
> from it, though.
But that would become a too complex interface, imo without a
corresponding benefit. If you skip the changes when rolling forward,
there's no point in streaming them out in the first place.
> I think it sucks (that's the technical term) to have to wait for all
> currently-running transactions to terminate before being able to begin
> streaming changes, because that could take a long time.
I don't think there's much of an alternative for replication solutions,
for other usecases, we may want to add an option to skip the wait. It's
not like that's something you do all the time. As soon as a slot was
acquired, there's no further waits anymore.
> And you might
> well know that the long-running transaction which is rolling up
> enormous table A that you don't care about is never going to touch
> table B which you actually want to replicate. Now, ideally, the DBA
> would have a way to ignore that long-running transaction and force
> replication to start, perhaps with the caveat that if that
> long-running transaction actually does touch B after all then we have
> to resync.
Puh. I honestly have zero confidence in DBAs making an informed decision
about something like this. Honestly, for a replication solution, how
often do you think this will be an issue?
> So imagine this. After initiating logical replication, a replication
> solution either briefly x-locks a table it wants to replicate, so that
> there can't be anyone else touching it, or it observes who has a lock
> >= RowExclusiveLock and waits for all of those locks to drop away. At
> that point, it knows that no currently-in-progress transaction can
> have modified the table prior to the start of replication, and begins
> copying the table. If a transaction that began before the start of
> replication subsequently modifies the table, a WAL record will be
> written, and the core logical decoding support could let the plugin
> know by means of an optional callback (hey, btw, a change I can't
> decode just hit table XYZ). The plugin will need to respond by
> recopying the table, which sucks, but it was the plugin's decision to
> be optimistic in the first place, and that will in many cases be a
> valid policy decision. If no such callback arrives before the
> safe-snapshot point, then the plugin made the right bet and will reap
> the just rewards of its optimism.
Sure, all that's possible. But hell, it's complicated to use. If reality
proves people want this, lets go there, but lets get the basics right
and committed first.
All the logic around whether to decode a transaction is:
void
SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
int nsubxacts, TransactionId *subxacts)
...
if (builder->state < SNAPBUILD_CONSISTENT)
{
/* ensure that only commits after this are getting replayed */
if (builder->transactions_after < lsn)
builder->transactions_after = lsn;
and then
/*
* Should the contents of a transaction ending at 'ptr' be decoded?
*/
bool
SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
{
return ptr <= builder->transactions_after;
}
so it's not like it will require all too many changes.
What I can see as possibly getting into 9.4 is a FASTSTART option that
doesn't support exporting a snapshot, but doesn't have to wait for the
SNAPBUILD_CONSISTENT state in return. That's fine for some usecases,
although I don't think for any of the major ones.
> > It's not too difficult to provide an option to do that. What I've been
> > thinking of was to correlate the confirmation of consumption with the
> > transaction the SRF is running in. So, confirm the data as consumed if
> > it commits, and don't if not. I think we could do that relatively easily
> > by registering a XACT_EVENT_COMMIT.
>
> That's a bit too accident-prone for my taste. I'd rather the DBA had
> some equivalent of peek_at_replication(nchanges int).
One point for my suggested behaviour is that it closes a bigger
racecondition. Currently as soon as start_logical_replication() has
finished building the tuplestore it marks the endposition as
received. But we very well can fail before the user has received all
those changes.
The only other idea I have to close that situation is to add an explicit
function to confirm receiving the changes, but that sounds icky for
something exposed to SQL.
> >> Sounds about right, but I think we need to get religion about figuring
> >> out what terminology to use. At the moment it seems to vary quite a
> >> bit between "logical", "logical decoding", and "decoding". Not sure
> >> how to nail that down.
> >
> > Agreed. Perhaps we should just avoid both logical and decoding entirely
> > and go for "changestream" or similar?
>
> So wal_level=changestream? Not feeling it. Of course we don't have
> to be 100% rigid about this but we should try to make our terminology
> corresponding with natural semantic boundaries. Maybe we should call
> the process logical decoding, and the results logical streams, or
> something like that.
I am fine with that, but I wouldn't mind some opinions of people knowing
less about the implementation that you and I.
> > For me "logical decoding" can be the basis of "logical replication", but
> > also for other features.
>
> Such as?
* auditing
* cache invalidation
* concurrent, rewriting ALTER TABLE
* concurrent VACUUM/CLUSTER similar to pg_reorg
> > I don't really see what the usage of a special type has to do with this,
> > but I think that's besides your main point. What you're saying is that
> > the output plugin is just defined by a function name, possibly schema
> > prefixed. That has an elegance to it. +1
>
> Well, file_fdw_handler returns type fdw_handler. That's nice, because
> we can validate that we've got the right sort of object when what we
> want is an FDW handler. If it just returned type internal, it would
> be too easy to mix it up with something unrelated that passed back
> some other kind of binary goop.
My, badly expressed, point is that returning INTERNAL or a specialized
type seems orthogonal to either using a special catalog mapping the
output plugin name to a function returning callbacks in comparison to
just using the function's name as the output plugin name.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2013-12-12 18:56:42 | Re: ANALYZE sampling is too good |
Previous Message | Claudio Freire | 2013-12-12 18:33:49 | Re: ANALYZE sampling is too good |