Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas
Date: 2022-04-07 18:42:55
Message-ID: CAH2-WzkqGMbc2bbm2zwoSpy2RpH0KSvhMcyD6qWewPUbBy8gdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 7, 2022 at 7:01 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Because there's no place to put them in the existing page format. We
> jammed checksums into the 2-byte field that had previously been set
> aside for the TLI, but that wasn't really an ideal solution because it
> meant we ended up with a checksum that is only 16 bits wide. However,
> the 2 bytes set aside for the TLI weren't really being used
> effectively anyway, so repurposing them was relatively easy, and a
> 16-bit checksum is better than nothing.

But if we were in a green-field situation we'd probably not want to
use up several bytes for a nonse anyway. You said so yourself.

> I do understand that there are significant challenges and performance
> concerns around having these kinds of initdb-controlled page layout
> changes, so the future of that patch is unclear.

Why does it need to be at initdb time?

Though I cannot prove it, I suspect that the original intent of the
special area was to support an additional (though typically small)
variable length array, that works a little like the current line
pointer array. This array would have to grow backwards (newer items
get appended at earlier physical offsets), unlike our line pointer
array (which gets appended to at the end, in the simple and obvious
way). Growing backwards like this happens with DB systems, that store
their line pointer array at the end of the page(the traditional
approach from the System R days, I believe).

Supporting a variable-length special area array like this would mean
that any time you add a new item to the variable-sized array in the
special area, the page's entire tuple space has to be memmove()'d
backwards by a couple of bytes to create the required space. And so
the relevant bufpage.c routine would have to adjust the whole line
pointer array such that each lp_off received a compensating
adjustment. The array might only be for some kind of page-level
transaction metadata, something like that -- shifting it around is
pretty expensive (reusing existing slots isn't too expensive, though).

Why can't it work like that? You don't really need to build the full
set of bufpage.c facilities (though it might not be a bad idea to
fully support these variable-length arrays, which seem like they might
come in handy). That seems perfectly compatible with what Matthias
wants to do, provided we're willing to deem the special area struct
(e.g. BTOpaque) as always coming "first" (which is essentially the
same as his current proposal anyway). You can even do the same thing
yourself for the nonse (use a fixed, known offset), with relatively
modest effort. You'd need to have AM-specific knowledge (it would
stack right on top of Matthias's technique), but that doesn't seem all
that hard. There are plenty of remaining status bits in BTOpaque, and
probably all other index AM special areas.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-04-07 18:54:08 Re: test/isolation/expected/stats_1.out broken for me
Previous Message Tomas Vondra 2022-04-07 18:34:50 Re: logical decoding and replication of sequences