Quick Links

Reducing the size of BufferTag & remodeling forks

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Reducing the size of BufferTag & remodeling forks
Date:	2015-07-02 13:36:19
Message-ID:	20150702133619.GB16267@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I've complained a number of times that our BufferTag is ridiculously
large:
typedef struct buftag
{
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum;
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;

typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;

that amounts to 20 bytes. That's problematic because we frequently have
to compare or hash the entire buffer tag. Comparing 20bytes is rather
branch intensive, and shows up noticably on profiles. It's also a
stumbling block on the way to a smarter buffer mapping data structure,
because it makes e.g. trees rather deep.

The buffer tag is currently used in two situations:

1) Dealing with the buffer mapping, we need to identify the underlying
file uniquely and we need the block number (8 bytes).

2) When writing out the a block we need, in addition to 1), have
information about where to store the file. That requires the
tablespace and database.

You may know that a filenode (RelFileNode->relNode) is currently *not*
unique across databases and tablespaces.

Additionally you might have noticed that the above description also
disregards relation forks.

I think we should work towards 1) being sufficient for its purpose. My
suggestion to get there is twofold:

1) Introduce a shared pg_relfilenode table. Every table, even
shared/nailed ones, get an entry therein. It's there to make it
possibly to uniquely allocate relfilenodes across databases &
tablespaces.

2) Replace relation forks, with the exception of the init fork which is
special anyway, with separate relfilenodes. Stored in seperate
columns in pg_class.

This scheme has a number of advantages: We don't need to look at the
filesystem anymore to find out whether a relfilenode exists. The buffer
tags are 8 bytes. The number of stats doesn't scale O(#forks *
#relations) anymore, allowing us to add additional forks more easily.

I think something akin to init forks is going to survive because they've
to be copied without access to the catalogs - but that's fine, they just
aren't allowed to go through shared buffers. Afaics that's not a
problem.

Obviously this is a rather high-level description, but right now this
sounds doable to me.

Thoughts?

- Andres

Responses

Re: Reducing the size of BufferTag & remodeling forks at 2015-07-02 13:51:59 from Tom Lane
Re: Reducing the size of BufferTag & remodeling forks at 2015-07-03 16:59:07 from Alvaro Herrera
Re: Reducing the size of BufferTag & remodeling forks at 2015-09-12 12:12:26 from Simon Riggs

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2015-07-02 13:38:49	Re: WALWriter active during recovery
Previous Message	Simon Riggs	2015-07-02 13:34:48	Re: WALWriter active during recovery