From: | Peter Geoghegan <pg(at)heroku(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Making strxfrm() blobs in indexes work |
Date: | 2014-01-31 04:51:18 |
Message-ID: | CAM3SWZQozR1R8qAgF+bkdbXf_4=gJw6cSQnMHa4r=pHbSeYKOg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jan 30, 2014 at 5:05 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Thu, Jan 30, 2014 at 5:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> That's not hard to prevent. If that should happen, we don't go with
>>> the strxfrm() datum. We have a spare IndexTuple bit we could use to
>>> mark when the optimization was applied.
>>
>> You'd need a bit per column, no?
>
> I don't think so. It would be no big deal if it was all or nothing.
I've done some more digging. It turns out that the 1977 paper "An
Encoding Method for Multifield Sorting and Indexing" describes a
technique that involves concatenating multiple column values and
comparing them using a simple strcmp(). Apparently this is something
that was in system R back in the 1970s. Here is a description of that
paper:
"Sequences of character strings with an order relation imposed between
sequences are considered. An encoding scheme is described which
produces a single, order-preserving string from a sequence of strings.
The original sequence can be recovered from the encoded string, and
one sequence of strings precedes another if and only if the encoding
of the first precedes the encoding of the second. The strings may be
variable length, without a maximum length restriction, and no symbols
need be reserved for control purposes. Hence any symbol may occur in
any string. The scheme is useful for multifield sorting, multifield
indexing, and other applications where ordering on more than one field
is important."
I think we should implement the scheme in this paper, for inner B-Tree
pages. The paper describes how lexicographic sort order must be
preserved, so it's pretty much identical to what I've described,
except it doesn't say anything about inner pages presumably because
there was no Unicode back then, and I don't think that B+Trees/B-link
trees had fully caught on yet. We're already suffering from the fact
that strcoll() mandates a NULL-terminated string, since we have to
copy text datums to a buffer to deal with the impedance mismatch.
Furthermore, multi-column comparisons probably have a lot of overhead
at present from all the indirection alone.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Claudio Freire | 2014-01-31 05:26:18 | Prefix compression of B-Tree keys |
Previous Message | Tom Lane | 2014-01-31 04:44:31 | Re: pgindent behavior we could do without |