Quick Links

Re: Making strxfrm() blobs in indexes work

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Making strxfrm() blobs in indexes work
Date:	2014-01-31 04:51:18
Message-ID:	CAM3SWZQozR1R8qAgF+bkdbXf_4=gJw6cSQnMHa4r=pHbSeYKOg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Jan 30, 2014 at 5:05 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Thu, Jan 30, 2014 at 5:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> That's not hard to prevent. If that should happen, we don't go with
>>> the strxfrm() datum. We have a spare IndexTuple bit we could use to
>>> mark when the optimization was applied.
>>
>> You'd need a bit per column, no?
>
> I don't think so. It would be no big deal if it was all or nothing.

I've done some more digging. It turns out that the 1977 paper "An
Encoding Method for Multifield Sorting and Indexing" describes a
technique that involves concatenating multiple column values and
comparing them using a simple strcmp(). Apparently this is something
that was in system R back in the 1970s. Here is a description of that
paper:

"Sequences of character strings with an order relation imposed between
sequences are considered. An encoding scheme is described which
produces a single, order-preserving string from a sequence of strings.
The original sequence can be recovered from the encoded string, and
one sequence of strings precedes another if and only if the encoding
of the first precedes the encoding of the second. The strings may be
variable length, without a maximum length restriction, and no symbols
need be reserved for control purposes. Hence any symbol may occur in
any string. The scheme is useful for multifield sorting, multifield
indexing, and other applications where ordering on more than one field
is important."

I think we should implement the scheme in this paper, for inner B-Tree
pages. The paper describes how lexicographic sort order must be
preserved, so it's pretty much identical to what I've described,
except it doesn't say anything about inner pages presumably because
there was no Unicode back then, and I don't think that B+Trees/B-link
trees had fully caught on yet. We're already suffering from the fact
that strcoll() mandates a NULL-terminated string, since we have to
copy text datums to a buffer to deal with the impedance mismatch.
Furthermore, multi-column comparisons probably have a lot of overhead
at present from all the indirection alone.

--
Peter Geoghegan

In response to

Re: Making strxfrm() blobs in indexes work at 2014-01-31 01:05:22 from Peter Geoghegan

Responses

Re: Making strxfrm() blobs in indexes work at 2014-02-03 01:09:06 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Claudio Freire	2014-01-31 05:26:18	Prefix compression of B-Tree keys
Previous Message	Tom Lane	2014-01-31 04:44:31	Re: pgindent behavior we could do without