Re: heap metapages

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: heap metapages
Date: 2012-05-22 01:50:37
Message-ID: CA+TgmoZXT=Yw1qvhdFhVcBkgo3LJfWyeu=x2DG7Vd626T52fPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 21, 2012 at 3:15 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I very much like the idea of a common framework to support multiple
> requirements. If we can view a couple of other designs as well it may
> quickly become clear this is the right way. In any case, the topics
> discussed here are important ones, so thanks for covering them.

I considered a couple of other possibilities:

- We could split pg_class into pg_class and pg_class_nt
(non-transactional). This would solve problem #1 (allowing
pg_class/pg_attribute entries for system catalogs to be shared across
all databases) but it doesn't do anything for problem #3 (excessive
inode consumption) or problem #4 (watermarking for crash recovery) and
isn't very good for problem #2 (maintenance of non-transactional
state) either, since part of the hope here is that we'd be able to get
at this state during recovery even when HS is not used.

- In lieu of adding an entire meta-page, we could just add some
special space to the first page, or maybe to every N'th page. Adding
space to every N'th page would be the best solution to problem #4
(watermarking), and adding even a small amount of state to the first
page would be enough for problems #1 and #2. However, I don't think
it would work for problem #3 (reducing inode consumption) because even
if the special space is pretty big, you won't really be able to mix
tuples and visibility map information (for example) on the same page
without complicating the buffer locking regimen unbearably. The dance
we have to do to make the visibility map crash-safe is already a lot
hairier than I'd really prefer. Also, I think we really need a lot of
this info for both tables and indexes, and I think it will be simpler
to decide that everything has a metapage rather than to decide that
some things have a metapage and some things just have a little extra
stuff crammed into the special space.

- I considered the idea of designing a crash-safe persistent hash
table, that would be sort of like a table but really more like a
key-value store with keys and values being C structs. This would be
similar to the pg_class/pg_class_nt split idea, except that
pg_class_nt would be one of these new crash-safe persistent hash table
objects, rather than a normal table; and there's a decent possibility
we'd find other applications for such a beast. However, it wouldn't
help with problem #3 or problem #4; and Tom seemed to be gravitating
toward the design in my OP rather than this idea. One point that was
raised is that btree and hash indexes already have a metapage, so
sticking a little more data into it doesn't really cost anything; and
heap relations are pretty much going to end up nailing the visibility
map and free space map pages in cache, so it's not clear that this is
any less cache-efficient in those cases either. For all that, I kind
of like the idea of a persistent hash table object, which I suspect
could be used to solve some problems not on the list in my OP as well
as some of the ones that are there, but I don't feel too bad laying
that idea aside for now. If it's really a good idea, it'll come up
again.

> What springs immediately to mind is why this would not be just another fork.

This was pretty much the first thing I considered, but it makes
problem #3 worse, and I really don't want do that. I think 3 inodes
per table is already too many, and I expect the problem to get worse.
I feel like every third crazy feature idea I come up with involves
creating yet another relation fork, and I'm pretty sure I won't be the
last person to think about such things, and so we're probably headed
that way, but I think we'd better try to hold the line as much as is
reasonably possible.

One random idea would be to have pg_upgrade create a special one-block
relation fork for the heap metapage that would get folded into the
main fork the first time the table gets rewritten. So we'd add
another fork, but only as a hack to facilitate in-place upgrade.

> This is important. I like the idea of breaking down the barriers
> between databases to allow it to be an option for one backend to
> access tables in multiple databases. The current mechanism doesn't
> actually prevent looking at data from other databases using internal
> APIs, so full security doesn't exist. It's a very common user
> requirement to wish to join tables stored in different databases,
> which ought to be possible more cleanly with correct privileges.

As Stephen says, this would require a lot more than just making
pg_class_shared/pg_attribute_shared work, and I don't particularly
believe it's a good idea anyway. That having been said, if we decided
we wanted to go this way in some future release, having done this
first couldn't but help.

> I thought there was a patch that put that info in a separate table 1:1
> with pg_class.
>
> Not very sure why a metapage is better than a catalog table.

Mostly because there's no chance of the startup process accessing a
catalog table during recovery, but it can read a metapage.

> We would
> still want a view that allows us to access that data as if it were a
> catalog table.

Agreed. Tom said the same.

> Again, there are other ways to optimise the FSM for small tables.

True, but that doesn't make this a bad one.

>> 4. Every once in a while, somebody's database ends up in pieces in
>> lost+found.  We could make this a bit easier to recover from by
>> including the database OID, relfilenode, and table OID in the
>> metapage.  This wouldn't be perfect, since a relation over one GB
>> would still only have one metapage, so additional relation segments
>> would still be a problem.  But it would be still be a huge improvement
>> over the status quo: some very large percentage of the work of putting
>> everything back where it goes could probably be done by a Perl script
>> that read all the metapages, and if you needed to know, say, which
>> file contained pg_class, that would be a whole lot easier, too.
>
> That sounds like the requirement that is driving this idea.

No, I listed it fourth because I think it's the least interesting
benefit. It IS a benefit, but if this were the primary goal it would
be a LOT simpler to shove a few bytes into every N'th heap special
space. I coded up a patch for that on my other laptop, and then
reformatted the hard drive without saving the patch (brilliant!), so I
no longer have working code for this. But it's not that hard. I am
much more interested in benefit #2, the ability to maintain
non-transactional state that can be read by the startup process during
recovery, than I am in this goal. Unfortunately that's harder, but I
think it's worth the effort.

> You don't have to rewrite the table, you just need to update the rows
> so they migrate to another block.

True.

> That seems easy enough, but still not sure why you wouldn't just use
> another fork. Or another idea would be to have the first page have a
> non-zero pd_special.

See above for a discussion of these points.

> I know you were recording what was discussed as an initial starting
> point. Looks like a good set of problems to solve.

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-05-22 01:54:07 Re: heap metapages
Previous Message Stephen Frost 2012-05-22 00:59:34 Re: How could we make it simple to access the log as a table?