From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Reducing the size of BufferTag & remodeling forks |
Date: | 2015-07-02 13:36:19 |
Message-ID: | 20150702133619.GB16267@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I've complained a number of times that our BufferTag is ridiculously
large:
typedef struct buftag
{
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum;
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;
that amounts to 20 bytes. That's problematic because we frequently have
to compare or hash the entire buffer tag. Comparing 20bytes is rather
branch intensive, and shows up noticably on profiles. It's also a
stumbling block on the way to a smarter buffer mapping data structure,
because it makes e.g. trees rather deep.
The buffer tag is currently used in two situations:
1) Dealing with the buffer mapping, we need to identify the underlying
file uniquely and we need the block number (8 bytes).
2) When writing out the a block we need, in addition to 1), have
information about where to store the file. That requires the
tablespace and database.
You may know that a filenode (RelFileNode->relNode) is currently *not*
unique across databases and tablespaces.
Additionally you might have noticed that the above description also
disregards relation forks.
I think we should work towards 1) being sufficient for its purpose. My
suggestion to get there is twofold:
1) Introduce a shared pg_relfilenode table. Every table, even
shared/nailed ones, get an entry therein. It's there to make it
possibly to uniquely allocate relfilenodes across databases &
tablespaces.
2) Replace relation forks, with the exception of the init fork which is
special anyway, with separate relfilenodes. Stored in seperate
columns in pg_class.
This scheme has a number of advantages: We don't need to look at the
filesystem anymore to find out whether a relfilenode exists. The buffer
tags are 8 bytes. The number of stats doesn't scale O(#forks *
#relations) anymore, allowing us to add additional forks more easily.
I think something akin to init forks is going to survive because they've
to be copied without access to the catalogs - but that's fine, they just
aren't allowed to go through shared buffers. Afaics that's not a
problem.
Obviously this is a rather high-level description, but right now this
sounds doable to me.
Thoughts?
- Andres
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2015-07-02 13:38:49 | Re: WALWriter active during recovery |
Previous Message | Simon Riggs | 2015-07-02 13:34:48 | Re: WALWriter active during recovery |