From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <peter(at)2ndquadrant(dot)com>, hlinnakangas(at)vmware(dot)com |
Subject: | Re: First draft of snapshot snapshot building design document |
Date: | 2012-10-18 15:20:27 |
Message-ID: | 201210181720.27616.andres@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote:
> On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> > On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
> >> The design document [2] really just explains the problem (which is the
> >> need for catalog metadata at a point in time to make sense of heap
> >> tuples), without describing the solution that this patch offers with
> >> any degree of detail. Rather, [2] says "How we build snapshots is
> >> somewhat intricate and complicated and seems to be out of scope for
> >> this document", which is unsatisfactory. I look forward to reading the
> >> promised document that describes this mechanism in more detail.
> >
> > Here's the first version of the promised document. I hope it answers most
> > of the questions.
> >
> > Input welcome!
>
> I haven't grokked all of this in its entirety, but I'm kind of
> uncomfortable with the relfilenode -> OID mapping stuff. I'm
> wondering if we should, when logical replication is enabled, find a
> way to cram the table OID into the XLOG record. It seems like that
> would simplify things.
>
> If we don't choose to do that, it's worth noting that you actually
> need 16 bytes of data to generate a unique identifier for a relation,
> as in database OID + tablespace OID + relfilenode# + backend ID.
> Backend ID might be ignorable because WAL-based logical replication is
> going to ignore temporary relations anyway, but you definitely need
> the other two. ...
Hm. I should take look at the way temporary tables are represented. As you say
I is not going to matter for WAL decoding, but still...
> Another thing to think about is that, like catalog snapshots,
> relfilenode mappings have to be time-relativized; that is, you need to
> know what the mapping was at the proper point in the WAL sequence, not
> what it is now. In practice, the risk here seems to be minimal,
> because it takes a while to churn through 4 billion OIDs. However, I
> suspect it pays to think about this fairly carefully because if we do
> ever run into a situation where the OID counter wraps during a time
> period comparable to the replication lag, the bugs will be extremely
> difficult to debug.
I think with a rollbacks + restarts we might even be able to see the same
relfilenode earlier.
> Anyhow, adding the table OID to the WAL header would chew up a few
> more bytes of WAL space, but it seems like it might be worth it to
> avoid having to think very hard about all of these issues.
I don't think its necessary to change wal logging here. The relfilenode mapping
is now looked up using the timetravel snapshot we've built using (spcNode,
relNode) as the key, so the time-relativized lookup is "builtin". If we screw
that up way much more is broken anyway.
Two problems are left:
1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is
stored for relfilenode if its a shared or nailed table. That not a problem for
the lookup because weve already checked the relmapper before that, so we never
look those up anyway. But it violates documented requirements of syscache.c.
Even after some looking I haven't found any problem that that could cause.
2) We need to decide whether a HEAP[1-2]_* record did catalog changes when
building/updating snapshots. Unfortunately we also need to do this *before* we
built the first snapshot. For now treating all tables as catalog modifying
before we built the snapshot seems to work fine.
I think encoding the oid in the xlog header wouln't help all that much here,
because I am pretty sure we want to have the set of "catalog tables" to be
extensible at some point...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2012-10-18 15:25:57 | Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached) |
Previous Message | Alvaro Herrera | 2012-10-18 15:19:31 | Re: Bug in -c CLI option of pg_dump/pg_restore |