Quick Links

Re: making relfilenodes 56 bits

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: making relfilenodes 56 bits
Date:	2022-06-30 19:24:04
Message-ID:	CA+TgmoaffGaEmEGjx1QQjL5c69iMpB4ytXsPAQ33XxtpgBfvWA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> PFA, the remaining set of patches. It might need to fix some
> indentation but lets first see how is the overall idea then we can
> work on it

So just playing around with this patch set, and also looking at the
code a bit, here are a few random observations:

- The patch assigns relfilenumbers starting with 1. I don't see any
specific problem with that, but I wonder if it would be a good idea to
start with a random larger value just in case we ever need some fixed
values for some purpose or other. Maybe we should start with 100000 or
something?

- If I use ALTER TABLE .. SET TABLESPACE to move a table around, then
the relfilenode changes each time, but if I use ALTER DATABASE .. SET
TABLESPACE to move a database around, the relfilenodes don't change.
So, what this guarantees is that if the same filename is used twice,
it will be for the same relation and not some unrelated relation.
That's enough to avoid the hazard described in the comments for
mdunlink(), because that scenario intrinsically involves confusion
caused by two relations using the same filename after an OID
wraparound. And it also means that if we pursue the idea of using an
end-of-recovery record in all cases, we don't need to start creating
tombstones during crash recovery. The forced checkpoint at the end of
crash recovery means we don't currently need to do that, but if we
change that, then the same hazard would exist there as we already have
in normal running, and this fixes it. However, I don't find it
entirely obvious that there are no hazards of any kind stemming from
repeated use of ALTER DATABASE .. SET TABLESPACE resulting in
filenames getting reused. On the other hand avoiding filename reuse
completely would be more work, not closely related to what the rest of
the patch set does, probably somewhat controversial in terms of what
it would have to do, and I'm not sure that we really need it. It does
seem like it would be quite a bit easier to reason about, though,
because the current guarantee is suspiciously similar to "we don't do
X, except when we do." This is not really so much a review comment for
Dilip as a request for input from others ... thoughts?

- Again, not a review comment for this patch specifically, but I'm
wondering if we could use this as infrastructure for a tool to clean
orphaned files out of the data directory. Suppose we create a file for
a new relation and then crash, leaving a potentially large file on
disk that will never be removed. Well, if the relfilenumber as it
exists on disk is not in pg_class and old enough that a transaction
inserting into pg_class can't still be running, then it must be safe
to remove that file. Maybe that's safe even today, but it's a little
hard to reason about it in the face of a possible OID wraparound that
might result in reusing the same numbers over again. It feels like
this makes easier to identify which files are old stuff that can never
again be touched.

- I might be missing something here, but this isn't actually making
the relfilenode 56 bits, is it? The reason to do that is to make the
BufferTag smaller, so I expected to see that BufferTag either used
bitfields like RelFileNumber relNumber:56 and ForkNumber forkNum:8, or
else that it just declared a single field for both as uint64 and used
accessor macros or static inlines to separate them out. But it doesn't
seem to do either of those things, which seems like it can't be right.
On a related note, I think it would be better to declare RelFileNumber
as an unsigned type even though we have no use for the high bit; we
have, equally, no use for negative values. It's easier to reason about
bit-shifting operations with unsigned types.

- I also think that the cross-version compatibility stuff in
pg_buffercache isn't quite right. It does values[1] =
ObjectIdGetDatum(fctx->record[i].relfilenumber). But I think what it
ought to do is dependent on the output type. If the output type is
int8, then it ought to do values[1] = Int64GetDatum((int64)
fctx->record[i].relfilenumber), and if it's OID, then it ought to do
values[1] = ObjectIdGetDatum((Oid) fctx->record[i].relfilenumber)).
The macro that you use needs to be based on the output SQL type, not
the C data type.

- I think it might be a good idea to allocate RelFileNumbers in much
smaller batches than we do OIDs. 8192 feels wasteful to me. It
shouldn't practically matter, because if we have 56 bits of bit space
and so even if we repeatedly allocate 2^13 RelFileNumbers and then
crash, we can still crash 2^41 times before we completely run out of
numbers, and 2 trillion crashes ought to be enough for anyone. But I
see little benefit from being so profligate. You can allocate an OID
as an identifier for a catalog tuple or a TOAST chunk, but a
RelFileNumber requires a filesystem operation, so the amount of work
that is needed to use up 8192 RelFileNumbers is a lot bigger than the
amount of work required to use up 8192 OIDs. If we dropped this down
to 128, or 64, or 256, would anything bad happen?

- Do we really want GetNewRelFileNumber() to call access() just for a
can't-happen scenario? Can't we catch this problem later when we
actually go to create the files on disk?

- The patch updates the comments in XLogPrefetcherNextBlock to talk
about relfilenumbers being reused rather than relfilenodes being
reused, which is fine except that we're sorta kinda not doing that any
more as noted above. I don't really know what these comments ought to
say instead but perhaps more than a mechanical update is in order.
This applies, even more, to the comments above mdunlink(). Apart from
updating the existing comments, I think that the patch needs a good
explanation of the new scheme someplace, and what it does and doesn't
guarantee, which relates to the point above about making sure we know
exactly what we're guaranteeing and why. I don't know where exactly
this text should be positioned yet, or what it should say, but it
needs to go someplace. This is a fairly significant change and needs
to be talked about somewhere.

- I think there's still a bit of a terminology problem here. With the
patch set, we use RelFileNumber to refer to a single, 56-bit integer
and RelFileLocator to refer to that integer combined with the DB and
TS OIDs. But sometimes in the comments we want to talk about the
logical sequence of files that is identified by a RelFileLocator, and
that's not quite the same as either of those things. For example, in
tableam.h we currently say "This callback needs to create a new
relation filenode for `rel`" and how should that be changed in this
new naming? We're not creating a new RelFileNumber - those would need
to be allocated, not created, as all the numbers in the universe exist
already. Neither are we creating a new locator; that sounds like it
means assembling it from pieces. What we're doing is creating the
first of what may end up being a series of similarly-named files on
disk. I'm not exactly sure how we can refer to that in a way that is
clear, but it's a problem that arises here and here throughout the
patch.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Re: making relfilenodes 56 bits at 2022-06-29 09:15:09 from Dilip Kumar

Responses

Re: making relfilenodes 56 bits at 2022-07-01 10:42:01 from Dilip Kumar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexander Korotkov	2022-06-30 19:34:38	Re: Add red-black tree missing comparison searches
Previous Message	Jacob Champion	2022-06-30 18:50:41	Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit