Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
Date: 2019-04-22 17:32:04
Message-ID: 20190422173204.GK6197@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Peter Geoghegan (pg(at)bowt(dot)ie) wrote:
> On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > This seems like it would be helpful for global indexes as well, wouldn't
> > it?
>
> Yes, though that should probably work by reusing what we already do
> with heap TID (use standard IndexTuple fields on the leaf level for
> heap TID), plus an additional identifier for the partition number that
> is located at the physical end of the tuple. IOW, I think that this
> might benefit from a design that is half way between what we already
> do with heap TIDs and what we would be required to do to make varwidth
> logical row identifiers in tables work -- the partition number is
> varwidth, though often only a single byte.

Yes, global indexes for partitioned tables could potentially be simpler
than the logical row identifiers, but maybe it'd be useful to just have
one implementation based around logical row identifiers which ends up
working for global indexes as well as the other types of indexes and
table access methods.

If we thought that the 'single-byte' partition number covered enough
use-cases then we could possibly consider supporting them for partitions
by just 'stealing' a byte from BlockIdData and having the per-partition
size be limited to 4TB when a global index exists on the partitioned
table. That's certainly not an ideal limitation but it might be
appealing to some users who really would like global indexes and could
possibly require less to implement, though there's a lot of other things
that would have to be done to have global indexes. Anyhow, just some
random thoughts that I figured I'd share in case there might be
something there worth thinking about.

> > I agree with trying to avoid having padding 'in the wrong place' and if
> > it makes some indexes smaller, great, even if they're unlikely to be
> > interesting in the vast majority of cases, they may still exist out
> > there. Of course, this is provided that it doesn't overly complicate
> > the code, but it sounds like it wouldn't be too bad in this case.
>
> Here is what it took:
>
> * Removed the "conservative" MAXALIGN() within index_form_tuple(),
> bringing it in line with heap_form_tuple(), which only MAXALIGN()s so
> that the first attribute in tuple's data area can safely be accessed
> on alignment-picky platforms, but doesn't do the same with data_len.
>
> * Removed most of the MAXALIGN()s from nbtinsert.c, except one that
> considers if a page split is required.
>
> * Didn't change the nbtsplitloc.c code, because we need to assume
> MAXALIGN()'d space quantities there. We continue to not trust the
> reported tuple length to be MAXALIGN()'d, which is now essentially
> rather than just defensive.
>
> * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the
> whole tuple size in the case where new pivot tuple requires a heap TID
> representation. We access TIDs as 3 2 byte integers, so this is
> necessary for alignment-picky platforms.
>
> I will pursue this as a project for PostgreSQL 13. It doesn't affect
> on-disk compatibility, because BTreeTupleGetHeapTID() works just as
> well with either the existing scheme, or this new one. Having the
> "real" tuple length available will make it easier to implement "true"
> suffix truncation, where we truncate *within* a text attribute (i.e.
> generate a new, shorter value using new opclass infrastructure).

This sounds pretty good to me, though I'm not nearly as close to the
code there as you are.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-04-22 17:36:44 Re: block-level incremental backup
Previous Message Tom Lane 2019-04-22 17:27:17 Re: clean up docs for v12