Re: Making strxfrm() blobs in indexes work

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making strxfrm() blobs in indexes work
Date: 2014-02-03 01:09:06
Message-ID: CAM3SWZS7wewrBmRGCi9_yCX49Ug6UgqN2xNGJG3Zq5v8LbDU4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 30, 2014 at 8:51 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> I've done some more digging. It turns out that the 1977 paper "An
> Encoding Method for Multifield Sorting and Indexing" describes a
> technique that involves concatenating multiple column values and
> comparing them using a simple strcmp().

So I thought about it again, and realized that this probably can't be
made to work given our present assumptions. Commit 656beff59 added the
varstr_cmp() strcmp() tie-breaker. If we cannot trust strcoll() to
indicate equality, why should we be able to trust strxfrm() + strcmp()
to do so, without an equivalent tie-breaker on the original string?
Now, I guess it might be good enough that the comparison in the inner
page indicated "equivalence" provided the comparison on leaf pages
indicated "equality proper". That seems like the kind of distinction
that I'd rather not have to make, though. You could end up with two
non-equal but equivalent tuples in an inner page. Seems pretty woolly
to me.

However, it also occurs to me that strxfrm() blobs have another useful
property: We (as, say, the author of an equality operator on text, an
operator intended for a btree operator class) *can* trust a strcmp()'s
result on blobs, provided the result isn't 0/equal, *even if the blobs
are truncated*. So maybe a better scheme, and certainly a simpler one
would be to have a pseudo-attribute in inner pages only with, say, the
first 8 bytes of a strxfrm() blob formed from the logically-leading
text attribute of the same indexTuple. Because we're only doing this
on inner pages, there is a very good chance that that will be good
enough most of the time. This also allows us to reduce bloat very
effectively.

As I touched on before, other schemes for other types do seem to
suggest themselves. We could support a pseudo leading attribute for
numeric too, for example, where we use a 64-bit integer as a proxy for
the whole number part (with the implementation having special handling
of "out of range" values by means of an internal magic number - as
always with the scheme, equality means "inconclusive, ask logically
leading attribute"). That could win just by mostly avoiding the
serialization overhead.

Now, the obvious counter-argument here is to point out that there are
worst cases where none of this helps. I suspect that those are so rare
in the real world, and the cost of doing all this is so low, that it's
still likely to be well worth it for any given use-case at the macro
level.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2014-02-03 01:09:11 Multiple calls to json_array_elements slow down nonlinearly
Previous Message Tomas Vondra 2014-02-02 23:13:00 Re: GIN improvements part2: fast scan