From: | Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com> |
---|---|
To: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, boris(dot)novikov(at)acm(dot)org, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com> |
Subject: | Re: Logical archiving |
Date: | 2020-12-07 03:05:35 |
Message-ID: | CAGRY4nw1MzOYn=zx+kZ43-jWUr5=30UkHi-LTbS2omwsTwVuRA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Actually CC'd Petr this time.
On Mon, 7 Dec 2020 at 11:05, Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>
wrote:
> Reply follows inline. I addressed your last point first, so it's out of
> order.
>
> On Fri, 4 Dec 2020 at 15:33, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote
>
> > If OLAP cannot consume data fast enough - we are out of space due to
> repl slot.
>
> There is a much simpler solution to this than logical PITR.
>
> What we should be doing is teaching xlogreader how to invoke the
> restore_command to fetch archived WALs for decoding.
>
> Replication slots already have a WAL retention limit, but right now when
> that limit is reached the slot is invalidated and becomes useless, it's
> effectively dropped. Instead, if WAL archiving is enabled, we should leave
> the slot as valid. If a consumer of the slot needs WAL that no longer
> exists in pg_wal, we should have the walsender invoke the restore_command
> to read the missing WAL segment, decode it, and remove it again.
>
> This would not be a technically difficult patch, and it's IMO one of the
> more important ones for improving logical replication.
>
> > I was discussing problems of CDC with scientific community and they
> asked this simple question: "So you have efficient WAL archive on a very
> cheap storage, why don't you have a logical archive too?"
>
> I've done work in this area, as has Petr (CC'd).
>
> In short, logical archiving and PITR is very much desirable, but we're not
> nearly ready for it yet and we're missing a lot of the foundations needed
> to make it really useful.
>
> IMO the strongest pre-requisite is that we need integrated DDL capture and
> replication in Pg. While this could be implemented in the
> publisher/subscriber logic for logical replication, it would make much more
> sense (IMO) to make it easier to feed DDL events into any logical
> replication output plugin.
>
> pglogical3 (the closed one) has quite comprehensive DDL replication
> support. Doing it is not simple though - there are plenty of complexities:
>
> * Reliably identifying the target objects and mapping them to replication
> set memberships for DML-replication
> * Capturing, replicating and managing the search_path and other DDL
> execution context (DateStyle and much more) reliably
>
> - Each statement type needs specific logic to indicate whether it
> needs DDL replication (and often filter functions since we have lots of
> sub-types where some need replication and some don't)
> - Handling DDL affecting global objects in pg_global correctly, like
> those affecting roles, grants, database security labels etc. There's no one
> right answer for this, it depends on the deployment and requires the user
> to cooperate.
> - Correct handling of transactions that mix DDL and DML (mostly only
> an issue for multimaster).
> - Identifying statements that target a mix of replicated and
> non-replicated objects and handling them appropriately, including for
> CASCADEs
> - Gracefully handling DDL statements that mix TEMPORARY and persistent
> targets. We can do this ok for DROPs but it still requires care. Anything
> else gets messier.
> - Lack of hooks into table rewrite operations and the extremely clumsy
> and inefficient way logical decoding currently exposes decoding of the
> temp-table data during decoding of rewrites means handling table-rewriting
> DDL is difficult and impractical to do correctly. In pglogical we punt on
> it entirely and refuse to permit DDL that would rewrite a table except
> where we can prove it's reliant only on immutable inputs so we can discard
> the upstream rewrite and rely on statement replication.
> - As a consequence of the above, reliably determining whether a given
> statement will cause a table rewrite.
> - Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
> - Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing
> something sensible for truncate cascade.
> - Probably more I've forgotten
>
>
> If we don't handle these, then any logical change-log archives will become
> largely useless as soon as there's any schema change.
>
> So we kind of have to solve DDL replication first IMO.
>
> Some consideration is also required for metadata management. Right now
> relation and type metadata has session-lifetime, but you'd want to be able
> to discard old logical change-stream archives and have the later ones still
> be usable. So we'd need to define some kind of restartpoint where we repeat
> the metadata, or we'd have to support externalizing the metadata so it can
> be retained when the main change archives get aged out.
>
> We'd also need to separate the existing apply worker into a "receiver" and
> "apply/writer" part, so the wire-protocol handling isn't tightly coupled
> with the actual change apply code, in order to make it possible to actually
> consume those archives and apply them to the database. In pglogical3 we did
> that by splitting them into two processes, connected by a shm_mq.
> Originally the process split was optional and you could run a combined
> receiver/writer process without the shm_mq if you wanted, but we quickly
> found it difficult to reliably handle locking issues etc that way so the
> writers all moved out-of-process.
>
> That was done mainly to make it possible to support parallelism in logical
> decoding apply. But we also have the intention of supporting an alternative
> reader process that can ingest "logical archives" and send them to the
> writer to apply them, as if they'd been received from the on-wire stream.
> That's not implemented at this time though. It'd be useful for a number of
> things:
>
> * PITR-style logical replay and recovery
> * Ability to pre-decode a txn once on the upstream then send the buffered
> protocol-stream to multiple subscribers, saving on logical decoding and
> reorder buffering overheads and write-multiplication costs
> * ability to ingest change-streams generated by non-postgres sources so we
> could support streaming foreign-data ingestion, streaming OLAP and data
> warehousing, etc
>
> To make logical PITR more useful we'd also want to be a bit more tolerant
> of schema divergence, though that's not overly hard to do:
>
> - fill defaults for downstream columns if no value is present for the
> column in the upstream row and the downstream column is nullable or has a
> default (I think built-in logical rep does this one already)
> - ignore values for columns in upstream data if the downstream table
> lacks the column and the upstream value is null
> - optionally allow apply to be configured to ignore non-null data in
> upstream columns that're missing on downstream
> - optionally allow apply to be configured to drop rows on the floor if
> the downstream table is missing
> - policies for handling data conflicts like duplicate PKs
>
> and we'd probably want ways to filter the apply data-stream to apply
> changes for only a subset of tables, rows, etc at least in a later version.
>
> None of this is insurmountable. Most or all of the DDL replication support
> and divergence-tolerance stuff is already done in production deployments
> using pglogical3 and bdr3.
>
> While I can't share the code, I am happy to share the experience I have
> gained from my part in working on these things. As you've probably recently
> seen with the wiki article I wrote on physical/logical failover interop.
>
> You're free to take information like this and use it in wiki articles too.
>
> Right now I won't be able to launch into writing big patches for these
> things, but I'll do my best to share what I can and review things.
>
> > This seems like a wild idea. But really, we have a super expensive NVMe
> drives for OLTP workload. And use this devices to store buffer for data to
> be dumped into MapReduce\YT analytical system.
>
> It's not a wild idea at all, as noted above.
>
> In pglogical3 we already support streaming decoded WAL data to alternative
> writer downstreams including RabbitMQ and Kafka via writer plugins.
>
From | Date | Subject | |
---|---|---|---|
Next Message | Zhihong Yu | 2020-12-07 03:27:18 | Re: Parallel Inserts in CREATE TABLE AS |
Previous Message | Craig Ringer | 2020-12-07 03:05:12 | Re: Logical archiving |