adding and upgrading metapages

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: adding and upgrading metapages
Date: 2012-05-23 19:20:45
Message-ID: CA+TgmobwLd3O77isSM=_9Dh_aTXpMuCCYoe-qyUVJGXM6PNGwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 23, 2012 at 2:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> In any case, I would've expected we'd end up changing all these page
> formats if we go over to your concept of a metapage for everything.
> Fixing indexes that already have some kind of metapage will surely be
> trivial compared to fixing heaps.

And on that note... let's talk (more) about how we do that. As far
as I can see, there are basically three possible ways we could add or
upgrade metapages: (1) put the logic in the old server, so that you
convert everything to the new format first and then upgrade; (2) put
the logic in pg_upgrade, so that the format conversion happens while
pg_upgrade is running; or (3) put the logic in the new server, so that
you upgrade first and then convert. Various hybrid options are also
possible. I am inclined to reject (1) out of hand because it would
require back-patching the format conversion code to every server
version from which we wish to support upgrades, which doesn't sound
prudent, and it would require people to be running a sufficiently-new
minor release of those server versions in order to upgrade, which is
inconvenient.

(2) is appealing by virtue of avoiding the need to simultaneously
understand multiple page formats. The major disadvantages are (a) we
can't do anything very complicated without a running server, and (b)
the database is probably toast if we crash midway through the
operation, since pg_upgrade can't write WAL. The latter problem
likely isn't really an issue in copy mode, but in link mode it seems
pretty serious: a badly timed crash can eat all of your data. We
could get around this, maybe, by having pg_upgrade copy files rather
than link them if a metapage must be added (but that'll be slower), or
we could invent some kind of WAL specifically for pg_upgrade (write a
record of the pre and post images of all the blocks we're changing
somewhere and fsync it before beginning to really make the changes; if
the server crashes midway through or the writes fail, you can run
pg_upgrade --undo to try to revert the changes, or something like
that).

(3) lets you make use of the server infrastructure (WAL, error
recovery, etc.) and allows mode complex changes to be made, but adds
complexity to the server.

It seems pretty clear that for the index types that already have
metapages (btree, gin, hash, spgist), things are not too bad. We have
to decide exactly where to rewrite the first block of the relation,
but that single-block change is all we need to do, so it's relatively
isolated. So the hard cases are gist and heaps. I asked Heikki for
suggestions on those cases and he proposed the following ideas:

- For GIST, Heikki proposed that we relocate the root page (which is
currently at page 0) to a new page at the end of the relation (or
maybe to a free page), overwrite the root page with a metapage, and
put a pointer to the root into the metapage; this would mimic what we
already do for btree. I think this is a relatively clean solution,
and it's clearly simple enough to be done inside pg_upgrade if we want
to avoid having the server know about it.

- For the heap, Heikki proposed that we do, well, basically the same
thing. Relocate block 0 to the end of the relation or to a free
block, store the block to which it was moved in the metapage, and
teach the index-scan machinery that whenever it sees a TID with block
0 it should replace the block number with the block number to which
the original block 0 was relocated. That's kind of a hack, but Heikki
likes it because it potentially lets most of the work happen inside
pg_upgrade. (We could make ALL the work happen inside pg_upgrade by
letting it scan the indexes and fix up the TIDs, but that's not a lot
more work that has to be done while the server is off-line.) The only
real downside of this method is that the check for block 0 has to be
done every time we fetch a TID from the index, which is a little
annoying, but then again if page 0 is going to be reserved for a
metapage we might want to guard against finding that value in an index
TID anyway.

- Alternatively, we could handle the heap case by taking an
AccessExclusiveLock and reinserting all the tuples in block 0. This
clearly isn't something pg_upgrade would do, so presumably the new
server would have to do it. One argument in favor of this approach is
that a number of the things for which we might really want a metapage
(like, the state we need to track to make an unlogged table logged or
the other way around) aren't going to be needed until somebody runs a
DDL command on the relation anyway. So even if a lot of time goes by
before the metapage gets added, maybe it doesn't really matter; and
when it does get added, it'll be at a time when we had to lock the
relation anyway for other reasons. However, this argument kind of
goes out the window if you want to use the metapage for stuff like
relfrozenxid or the initial segment of the visibility map, which I do.
So I'm inclined to think this is a dead end unless someone has an
idea for how to rescue it.

- A third possibility is that, when pg_upgrade is used, we could
arrange to write the metadata for non-system-catalog tables into a
separate relation fork. Any new relations created after the upgrade
would store the metadata in block 0 of the main fork, and running
CLUSTER, VACUUM FULL, or a rewriting ALTER TABLE command on the
relation would incorporate the metapage into the main fork as well.
So the extra metadata fork would only exist for relations brought over
via pg_upgrade that had not subsequently been subjected to a rewriting
operation. Under this design, the metapage always exists, but in some
cases it may be located in a different relation fork. The advantage
of this idea is just that index scans need not have any kind of
special case for block 0: but on the flip side, everyone who uses the
metapage for anything needs to be able to find it in one of two
possible locations. That might not be too bad, though: there
shouldn't be that many people who need direct access to the metapage
information; mostly, we should probably be keeping a copy in each
backend's relcache.

Thoughts, preferences, other ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Browse pgsql-hackers by date

  From Date Subject
Next Message Alastair Turner 2012-05-23 19:21:11 Re: [RFC] Interface of Row Level Security
Previous Message Robert Haas 2012-05-23 19:03:45 Re: Readme of Buffer Management seems to have wrong sentence