| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: making relfilenodes 56 bits | 
| Date: | 2022-07-01 10:42:01 | 
| Message-ID: | CAFiTN-sRqmyBVVVWXYcvnXST8Ps_jbS-icxKOxrm5KtvFyrTvw@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Fri, Jul 1, 2022 at 12:54 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Jun 29, 2022 at 5:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > PFA, the remaining set of patches.   It might need to fix some
> > indentation but lets first see how is the overall idea then we can
> > work on it
>
> So just playing around with this patch set, and also looking at the
> code a bit, here are a few random observations:
>
> - The patch assigns relfilenumbers starting with 1. I don't see any
> specific problem with that, but I wonder if it would be a good idea to
> start with a random larger value just in case we ever need some fixed
> values for some purpose or other. Maybe we should start with 100000 or
> something?
Yeah we can do that, I have changed to 100000.
> - If I use ALTER TABLE .. SET TABLESPACE to move a table around, then
> the relfilenode changes each time, but if I use ALTER DATABASE .. SET
> TABLESPACE to move a database around, the relfilenodes don't change.
> So, what this guarantees is that if the same filename is used twice,
> it will be for the same relation and not some unrelated relation.
> That's enough to avoid the hazard described in the comments for
> mdunlink(), because that scenario intrinsically involves confusion
> caused by two relations using the same filename after an OID
> wraparound. And it also means that if we pursue the idea of using an
> end-of-recovery record in all cases, we don't need to start creating
> tombstones during crash recovery. The forced checkpoint at the end of
> crash recovery means we don't currently need to do that, but if we
> change that, then the same hazard would exist there as we already have
> in normal running, and this fixes it. However, I don't find it
> entirely obvious that there are no hazards of any kind stemming from
> repeated use of ALTER DATABASE .. SET TABLESPACE resulting in
> filenames getting reused. On the other hand avoiding filename reuse
> completely would be more work, not closely related to what the rest of
> the patch set does, probably somewhat controversial in terms of what
> it would have to do, and I'm not sure that we really need it. It does
> seem like it would be quite a bit easier to reason about, though,
> because the current guarantee is suspiciously similar to "we don't do
> X, except when we do." This is not really so much a review comment for
> Dilip as a request for input from others ... thoughts?
Yeah that can be done, but maybe as a separate patch.  One option is
that when we will support the WAL method for the ALTER TABLE .. SET
TABLESPACE like we did for CREATE DATABASE, as part of that we will
generate the new relfilenumber.
> - Again, not a review comment for this patch specifically, but I'm
> wondering if we could use this as infrastructure for a tool to clean
> orphaned files out of the data directory. Suppose we create a file for
> a new relation and then crash, leaving a potentially large file on
> disk that will never be removed. Well, if the relfilenumber as it
> exists on disk is not in pg_class and old enough that a transaction
> inserting into pg_class can't still be running, then it must be safe
> to remove that file. Maybe that's safe even today, but it's a little
> hard to reason about it in the face of a possible OID wraparound that
> might result in reusing the same numbers over again. It feels like
> this makes easier to identify which files are old stuff that can never
> again be touched.
Correct.
> - I might be missing something here, but this isn't actually making
> the relfilenode 56 bits, is it? The reason to do that is to make the
> BufferTag smaller, so I expected to see that BufferTag either used
> bitfields like RelFileNumber relNumber:56 and ForkNumber forkNum:8, or
> else that it just declared a single field for both as uint64 and used
> accessor macros or static inlines to separate them out. But it doesn't
> seem to do either of those things, which seems like it can't be right.
> On a related note, I think it would be better to declare RelFileNumber
> as an unsigned type even though we have no use for the high bit; we
> have, equally, no use for negative values. It's easier to reason about
> bit-shifting operations with unsigned types.
Opps, somehow missed to merge that change in the patch.  Changed that
like below and adjusted the macros.
typedef struct buftag
{
Oid spcOid; /* tablespace oid. */
Oid dbOid; /* database oid. */
uint32 relNumber_low; /* relfilenumber 32 lower bits */
uint32 relNumber_hi:24; /* relfilenumber 24 high bits */
uint32 forkNum:8; /* fork number */
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;
I think we need to break like this to keep the BufferTag 4 byte
aligned otherwise the size of the structure will be increased.
> - I also think that the cross-version compatibility stuff in
> pg_buffercache isn't quite right. It does values[1] =
> ObjectIdGetDatum(fctx->record[i].relfilenumber). But I think what it
> ought to do is dependent on the output type. If the output type is
> int8, then it ought to do values[1] = Int64GetDatum((int64)
> fctx->record[i].relfilenumber), and if it's OID, then it ought to do
> values[1] = ObjectIdGetDatum((Oid) fctx->record[i].relfilenumber)).
> The  macro that you use needs to be based on the output SQL type, not
> the C data type.
Fixed
> - I think it might be a good idea to allocate RelFileNumbers in much
> smaller batches than we do OIDs. 8192 feels wasteful to me. It
> shouldn't practically matter, because if we have 56 bits of bit space
> and so even if we repeatedly allocate 2^13 RelFileNumbers and then
> crash, we can still crash 2^41 times before we completely run out of
> numbers, and 2 trillion crashes ought to be enough for anyone. But I
> see little benefit from being so profligate. You can allocate an OID
> as an identifier for a catalog tuple or a TOAST chunk, but a
> RelFileNumber requires a filesystem operation, so the amount of work
> that is needed to use up 8192 RelFileNumbers is a lot bigger than the
> amount of work required to use up 8192 OIDs. If we dropped this down
> to 128, or 64, or 256, would anything bad happen?
This makes sense so I have changed to 64.
> - Do we really want GetNewRelFileNumber() to call access() just for a
> can't-happen scenario? Can't we catch this problem later when we
> actually go to create the files on disk?
Yeah we don't need to, actually we can completely get rid of
GetNewRelFileNumber() function and we can directly call
GenerateNewRelFileNumber() and in fact we can rename
GenerateNewRelFileNumber() to GetNewRelFileNumber().  So I have done
these changes.
> - The patch updates the comments in XLogPrefetcherNextBlock to talk
> about relfilenumbers being reused rather than relfilenodes being
> reused, which is fine except that we're sorta kinda not doing that any
> more as noted above. I don't really know what these comments ought to
> say instead but perhaps more than a mechanical update is in order.
Changed
> This applies, even more, to the comments above mdunlink(). Apart from
> updating the existing comments, I think that the patch needs a good
> explanation of the new scheme someplace, and what it does and doesn't
> guarantee, which relates to the point above about making sure we know
> exactly what we're guaranteeing and why. I don't know where exactly
> this text should be positioned yet, or what it should say, but it
> needs to go someplace. This is a fairly significant change and needs
> to be talked about somewhere.
For now, in v4_0004**, I have removed the comment which is explaining
why we need to keep the Tombstone file and added some note that why we
do not need to keep those files from PG16 onwards.
> - I think there's still a bit of a terminology problem here. With the
> patch set, we use RelFileNumber to refer to a single, 56-bit integer
> and RelFileLocator to refer to that integer combined with the DB and
> TS OIDs. But sometimes in the comments we want to talk about the
> logical sequence of files that is identified by a RelFileLocator, and
> that's not quite the same as either of those things. For example, in
> tableam.h we currently say "This callback needs to create a new
> relation filenode for `rel`" and how should that be changed in this
> new naming? We're not creating a new RelFileNumber - those would need
> to be allocated, not created, as all the numbers in the universe exist
> already. Neither are we creating a new locator; that sounds like it
> means assembling it from pieces. What we're doing is creating the
> first of what may end up being a series of similarly-named files on
> disk. I'm not exactly sure how we can refer to that in a way that is
> clear, but it's a problem that arises here and here throughout the
> patch.
I think the comment can say
"This callback needs to create a new relnumber file for 'rel' " ?
I have not modified this yet, I will check other places where we have
such terminology issues.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
| Attachment | Content-Type | Size | 
|---|---|---|
| v4-0003-Use-56-bits-for-relfilenumber-to-avoid-wraparound.patch | text/x-patch | 74.1 KB | 
| v4-0004-Don-t-delay-removing-Tombstone-file-until-next-ch.patch | text/x-patch | 11.3 KB | 
| v4-0002-Preliminary-refactoring-for-supporting-larger-rel.patch | text/x-patch | 20.2 KB | 
| v4-0001-Rename-RelFileNode-to-RelFileLocator-and-relNode-.patch | text/x-patch | 412.2 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dagfinn Ilmari Mannsåker | 2022-07-01 10:43:21 | Re: Refactor construct_array() and deconstruct_array() for built-in types | 
| Previous Message | Peter Eisentraut | 2022-07-01 10:28:49 | Re: [PATCH] Add result_types column to pg_prepared_statements view |